CUDA C++ Programming Guide
CUDA C++ 编程指南

The programming guide to the CUDA model and interface.
CUDA 模型和接口的编程指南。

Changes from Version 12.4
从版本 12.4 的更改

  • Added section Asynchronous Data Copies using Tensor Memory Access (TMA).
    添加了使用张量内存访问(TMA)进行异步数据拷贝的部分。

  • Added Unified Memory Programming guide supporting Grace Hopper with Address Translation Service (ATS) and Heterogeneous Memory Management (HMM ) on x86.
    已添加支持在 x86 上使用地址转换服务(ATS)和异构内存管理(HMM)的统一内存编程指南,以支持 Grace Hopper。

1. Introduction 1. 简介 

1.1. The Benefits of Using GPUs
1.1. 使用 GPU 的好处 

The Graphics Processing Unit (GPU)1 provides much higher instruction throughput and memory bandwidth than the CPU within a similar price and power envelope. Many applications leverage these higher capabilities to run faster on the GPU than on the CPU (see GPU Applications). Other computing devices, like FPGAs, are also very energy efficient, but offer much less programming flexibility than GPUs.
图形处理单元(GPU)提供比 CPU 更高的指令吞吐量和内存带宽,在类似的价格和功耗范围内。许多应用程序利用这些更高的能力在 GPU 上比在 CPU 上运行得更快(请参阅 GPU 应用程序)。其他计算设备,如 FPGA,也非常节能,但比 GPU 提供的编程灵活性要少得多。

This difference in capabilities between the GPU and the CPU exists because they are designed with different goals in mind. While the CPU is designed to excel at executing a sequence of operations, called a thread, as fast as possible and can execute a few tens of these threads in parallel, the GPU is designed to excel at executing thousands of them in parallel (amortizing the slower single-thread performance to achieve greater throughput).
GPU 和 CPU 之间的能力差异是因为它们的设计目标不同。CPU 旨在尽可能快地执行一系列操作(称为线程),并且可以并行执行几十个这些线程,而 GPU 旨在并行执行成千上万个线程(通过摊销较慢的单线程性能以实现更大的吞吐量)。

The GPU is specialized for highly parallel computations and therefore designed such that more transistors are devoted to data processing rather than data caching and flow control. The schematic Figure 1 shows an example distribution of chip resources for a CPU versus a GPU.
GPU 专门用于高度并行计算,因此设计为更多的晶体管用于数据处理,而不是数据缓存和流程控制。示意图 1 显示了 CPU 与 GPU 的芯片资源分配示例。

The GPU Devotes More Transistors to Data Processing

Figure 1 The GPU Devotes More Transistors to Data Processing
图 1 GPU 将更多晶体管用于数据处理 

Devoting more transistors to data processing, for example, floating-point computations, is beneficial for highly parallel computations; the GPU can hide memory access latencies with computation, instead of relying on large data caches and complex flow control to avoid long memory access latencies, both of which are expensive in terms of transistors.
将更多晶体管用于数据处理,例如浮点运算,对于高度并行的计算是有益的;GPU 可以通过计算来隐藏内存访问延迟,而不是依赖大型数据缓存和复杂的流控制来避免长时间的内存访问延迟,这两者在晶体管方面都是昂贵的。

In general, an application has a mix of parallel parts and sequential parts, so systems are designed with a mix of GPUs and CPUs in order to maximize overall performance. Applications with a high degree of parallelism can exploit this massively parallel nature of the GPU to achieve higher performance than on the CPU.
通常,一个应用程序由并行部分和顺序部分组成,因此系统设计中会混合使用 GPU 和 CPU,以最大化整体性能。具有高度并行性的应用程序可以利用 GPU 的大规模并行特性,实现比在 CPU 上更高的性能。

1.2. CUDA®: A General-Purpose Parallel Computing Platform and Programming Model
1.2. CUDA®:通用并行计算平台和编程模型 

In November 2006, NVIDIA® introduced CUDA®, a general purpose parallel computing platform and programming model that leverages the parallel compute engine in NVIDIA GPUs to solve many complex computational problems in a more efficient way than on a CPU.
在 2006 年 11 月,NVIDIA ® 推出了 CUDA ® ,这是一个通用的并行计算平台和编程模型,利用 NVIDIA GPU 中的并行计算引擎,以比在 CPU 上更高效的方式解决许多复杂的计算问题。

CUDA comes with a software environment that allows developers to use C++ as a high-level programming language. As illustrated by Figure 2, other languages, application programming interfaces, or directives-based approaches are supported, such as FORTRAN, DirectCompute, OpenACC.
CUDA 配备了一个软件环境,允许开发人员将 C++作为高级编程语言。如图 2 所示,还支持其他语言、应用程序编程接口或基于指令的方法,如 FORTRAN、DirectCompute、OpenACC。

GPU Computing Applications. CUDA is designed to support various languages and application programming interfaces.

Figure 2 GPU Computing Applications. CUDA is designed to support various languages and application programming interfaces.
图 2 GPU 计算应用。CUDA 旨在支持各种语言和应用程序编程接口。 

1.3. A Scalable Programming Model
1.3. 可扩展的编程模型 

The advent of multicore CPUs and manycore GPUs means that mainstream processor chips are now parallel systems. The challenge is to develop application software that transparently scales its parallelism to leverage the increasing number of processor cores, much as 3D graphics applications transparently scale their parallelism to manycore GPUs with widely varying numbers of cores.
多核 CPU 和众核 GPU 的出现意味着主流处理器芯片现在是并行系统。挑战在于开发应用软件,使其能够透明地扩展并行性,以利用不断增加的处理器核心数量,就像 3D 图形应用程序能够透明地扩展其并行性以适应具有数量不同的核心的众核 GPU 一样。

The CUDA parallel programming model is designed to overcome this challenge while maintaining a low learning curve for programmers familiar with standard programming languages such as C.
CUDA 并行编程模型旨在克服这一挑战,同时保持对熟悉标准编程语言(如 C 语言)的程序员来说学习曲线较低。

At its core are three key abstractions — a hierarchy of thread groups, shared memories, and barrier synchronization — that are simply exposed to the programmer as a minimal set of language extensions.
在其核心是三个关键的抽象——线程组的层次结构、共享内存和屏障同步——这些抽象被简单地暴露给程序员,作为一组最小的语言扩展。

These abstractions provide fine-grained data parallelism and thread parallelism, nested within coarse-grained data parallelism and task parallelism. They guide the programmer to partition the problem into coarse sub-problems that can be solved independently in parallel by blocks of threads, and each sub-problem into finer pieces that can be solved cooperatively in parallel by all threads within the block.
这些抽象提供了细粒度的数据并行性和线程并行性,嵌套在粗粒度的数据并行性和任务并行性中。它们指导程序员将问题划分为粗粒度子问题,可以由线程块并行独立解决,以及将每个子问题划分为更细的部分,可以由块内所有线程协作并行解决。

This decomposition preserves language expressivity by allowing threads to cooperate when solving each sub-problem, and at the same time enables automatic scalability. Indeed, each block of threads can be scheduled on any of the available multiprocessors within a GPU, in any order, concurrently or sequentially, so that a compiled CUDA program can execute on any number of multiprocessors as illustrated by Figure 3, and only the runtime system needs to know the physical multiprocessor count.
这种分解通过允许线程在解决每个子问题时进行合作,从而保留了语言表达能力,同时实现了自动可伸缩性。事实上,每个线程块都可以在 GPU 内的任何可用多处理器上以任何顺序并发或顺序调度,因此编译的 CUDA 程序可以在任意数量的多处理器上执行,如图 3 所示,只有运行时系统需要知道物理多处理器计数。

This scalable programming model allows the GPU architecture to span a wide market range by simply scaling the number of multiprocessors and memory partitions: from the high-performance enthusiast GeForce GPUs and professional Quadro and Tesla computing products to a variety of inexpensive, mainstream GeForce GPUs (see CUDA-Enabled GPUs for a list of all CUDA-enabled GPUs).
这种可扩展的编程模型允许 GPU 架构通过简单地扩展多处理器和内存分区的数量来覆盖广泛的市场范围:从高性能的发烧友 GeForce GPU 和专业的 Quadro 以及 Tesla 计算产品,到各种廉价的主流 GeForce GPU(请参阅 CUDA-Enabled GPUs 以获取所有支持 CUDA 的 GPU 列表)。

Automatic Scalability

Figure 3 Automatic Scalability
图 3 自动可伸缩性 

Note 注意

A GPU is built around an array of Streaming Multiprocessors (SMs) (see Hardware Implementation for more details). A multithreaded program is partitioned into blocks of threads that execute independently from each other, so that a GPU with more multiprocessors will automatically execute the program in less time than a GPU with fewer multiprocessors.
GPU 是围绕着一系列流多处理器(SMs)构建的(有关更多详细信息,请参阅硬件实现)。多线程程序被划分为执行独立的线程块,因此具有更多多处理器的 GPU 将自动比具有较少多处理器的 GPU 更快地执行程序。

1.4. Document Structure
1.4. 文档结构 

This document is organized into the following sections:
本文档分为以下几个部分:

  • Introduction is a general introduction to CUDA.
    介绍是对 CUDA 的一般介绍。

  • Programming Model outlines the CUDA programming model.
    编程模型概述了 CUDA 编程模型。

  • Programming Interface describes the programming interface.
    编程接口描述编程接口。

  • Hardware Implementation describes the hardware implementation.
    硬件实现描述了硬件实现。

  • Performance Guidelines gives some guidance on how to achieve maximum performance.
    性能指南提供了一些关于如何实现最大性能的指导。

  • CUDA-Enabled GPUs lists all CUDA-enabled devices.
    CUDA-Enabled GPUs 列出了所有支持 CUDA 的设备。

  • C++ Language Extensions is a detailed description of all extensions to the C++ language.
    C++ 语言扩展是对 C++ 语言所有扩展的详细描述。

  • Cooperative Groups describes synchronization primitives for various groups of CUDA threads.
    协作组描述了 CUDA 线程的各种组的同步原语。

  • CUDA Dynamic Parallelism describes how to launch and synchronize one kernel from another.
    CUDA 动态并行描述了如何从另一个内核启动和同步一个内核。

  • Virtual Memory Management describes how to manage the unified virtual address space.
    虚拟内存管理描述了如何管理统一的虚拟地址空间。

  • Stream Ordered Memory Allocator describes how applications can order memory allocation and deallocation.
    流有序内存分配器描述了应用程序如何对内存分配和释放进行排序。

  • Graph Memory Nodes describes how graphs can create and own memory allocations.
    图形内存节点描述了图形如何创建和拥有内存分配。

  • Mathematical Functions lists the mathematical functions supported in CUDA.
    数学函数列出了 CUDA 支持的数学函数。

  • C++ Language Support lists the C++ features supported in device code.
    C++ 语言支持列出了设备代码中支持的 C++ 特性。

  • Texture Fetching gives more details on texture fetching.
    纹理获取提供有关纹理获取的更多详细信息。

  • Compute Capabilities gives the technical specifications of various devices, as well as more architectural details.
    计算能力提供了各种设备的技术规格,以及更多的架构细节。

  • Driver API introduces the low-level driver API.
    Driver API 介绍了低级驱动程序 API。

  • CUDA Environment Variables lists all the CUDA environment variables.
    CUDA 环境变量列出了所有 CUDA 环境变量。

  • Unified Memory Programming introduces the Unified Memory programming model.
    统一内存编程介绍了统一内存编程模型。

1

The graphics qualifier comes from the fact that when the GPU was originally created, two decades ago, it was designed as a specialized processor to accelerate graphics rendering. Driven by the insatiable market demand for real-time, high-definition, 3D graphics, it has evolved into a general processor used for many more workloads than just graphics rendering.
图形修饰符源自 GPU 最初创建时的事实,即在二十年前,它被设计为一种专用处理器,用于加速图形渲染。受到市场对实时、高清、3D 图形的无止境需求驱动,它已经演变成一种通用处理器,用于处理远不止图形渲染的工作负载。

2. Programming Model
2. 编程模型 

This chapter introduces the main concepts behind the CUDA programming model by outlining how they are exposed in C++.
本章介绍了 CUDA 编程模型背后的主要概念,概述了这些概念在 C++中的表现方式。

An extensive description of CUDA C++ is given in Programming Interface.
在《编程接口》中详细描述了 CUDA C++。

Full code for the vector addition example used in this chapter and the next can be found in the vectorAdd CUDA sample.
本章和下一章中使用的矢量加法示例的完整代码可以在 vectorAdd CUDA 示例中找到。

2.1. Kernels 2.1. 内核 

CUDA C++ extends C++ by allowing the programmer to define C++ functions, called kernels, that, when called, are executed N times in parallel by N different CUDA threads, as opposed to only once like regular C++ functions.
CUDA C++ 通过允许程序员定义称为内核的 C++ 函数来扩展 C++,当调用这些函数时,它们将由 N 个不同的 CUDA 线程并行执行 N 次,而不像常规的 C++ 函数只执行一次。

A kernel is defined using the __global__ declaration specifier and the number of CUDA threads that execute that kernel for a given kernel call is specified using a new <<<...>>>execution configuration syntax (see C++ Language Extensions). Each thread that executes the kernel is given a unique thread ID that is accessible within the kernel through built-in variables.
使用 __global__ 声明说明内核,并使用新的 <<<...>>> 执行配置语法(请参阅 C++语言扩展)指定给定内核调用执行该内核的 CUDA 线程数。执行内核的每个线程都被赋予一个唯一的线程 ID,在内核中可以通过内置变量访问该 ID。

As an illustration, the following sample code, using the built-in variable threadIdx, adds two vectors A and B of size N and stores the result into vector C:
作为示例,以下示例代码使用内置变量 threadIdx ,将大小为 N 的两个向量 A 和 B 相加,并将结果存储到向量 C 中:

// Kernel definition
__global__ void VecAdd(float* A, float* B, float* C)
{
    int i = threadIdx.x;
    C[i] = A[i] + B[i];
}

int main()
{
    ...
    // Kernel invocation with N threads
    VecAdd<<<1, N>>>(A, B, C);
    ...
}

Here, each of the N threads that execute VecAdd() performs one pair-wise addition.
在这里,执行 VecAdd() 的 N 个线程中的每一个执行一次成对加法。

2.2. Thread Hierarchy
2.2. 线程层次结构 

For convenience, threadIdx is a 3-component vector, so that threads can be identified using a one-dimensional, two-dimensional, or three-dimensional thread index, forming a one-dimensional, two-dimensional, or three-dimensional block of threads, called a thread block. This provides a natural way to invoke computation across the elements in a domain such as a vector, matrix, or volume.
为了方便起见, threadIdx 是一个包含 3 个分量的向量,因此可以使用一维、二维或三维线程索引来标识线程,形成一个一维、二维或三维线程块,称为线程块。这提供了一种自然的方式来调用计算,跨越诸如向量、矩阵或体积等域中的元素。

The index of a thread and its thread ID relate to each other in a straightforward way: For a one-dimensional block, they are the same; for a two-dimensional block of size (Dx, Dy), the thread ID of a thread of index (x, y) is (x + y Dx); for a three-dimensional block of size (Dx, Dy, Dz), the thread ID of a thread of index (x, y, z) is (x + y Dx + z Dx Dy).
线程的索引和线程 ID 之间有直接的关系:对于一维块,它们是相同的;对于大小为(Dx,Dy)的二维块,索引为(x,y)的线程的线程 ID 为(x + y Dx);对于大小为(Dx,Dy,Dz)的三维块,索引为(x,y,z)的线程的线程 ID 为(x + y Dx + z Dx Dy)。

As an example, the following code adds two matrices A and B of size NxN and stores the result into matrix C:
例如,以下代码将两个大小为 NxN 的矩阵 A 和 B 相加,并将结果存储到矩阵 C 中:

// Kernel definition
__global__ void MatAdd(float A[N][N], float B[N][N],
                       float C[N][N])
{
    int i = threadIdx.x;
    int j = threadIdx.y;
    C[i][j] = A[i][j] + B[i][j];
}

int main()
{
    ...
    // Kernel invocation with one block of N * N * 1 threads
    int numBlocks = 1;
    dim3 threadsPerBlock(N, N);
    MatAdd<<<numBlocks, threadsPerBlock>>>(A, B, C);
    ...
}

There is a limit to the number of threads per block, since all threads of a block are expected to reside on the same streaming multiprocessor core and must share the limited memory resources of that core. On current GPUs, a thread block may contain up to 1024 threads.
每个块中的线程数量是有限的,因为一个块中的所有线程都应该驻留在同一个流多处理器核心上,并且必须共享该核心的有限内存资源。在当前的 GPU 上,一个线程块最多可以包含 1024 个线程。

However, a kernel can be executed by multiple equally-shaped thread blocks, so that the total number of threads is equal to the number of threads per block times the number of blocks.
然而,一个内核可以被多个形状相同的线程块执行,因此线程总数等于每个块的线程数乘以块的数量。

Blocks are organized into a one-dimensional, two-dimensional, or three-dimensional grid of thread blocks as illustrated by Figure 4. The number of thread blocks in a grid is usually dictated by the size of the data being processed, which typically exceeds the number of processors in the system.
块被组织成一维、二维或三维线程块网格,如图 4 所示。 网格中线程块的数量通常由正在处理的数据大小决定,这通常超过系统中处理器的数量。

Grid of Thread Blocks

Figure 4 Grid of Thread Blocks
图 4 线程块网格 

The number of threads per block and the number of blocks per grid specified in the <<<...>>> syntax can be of type int or dim3. Two-dimensional blocks or grids can be specified as in the example above.
<<<...>>> 语法中指定的每个块的线程数和每个网格的块数可以是 intdim3 类型。如上例中可以指定二维块或网格。

Each block within the grid can be identified by a one-dimensional, two-dimensional, or three-dimensional unique index accessible within the kernel through the built-in blockIdx variable. The dimension of the thread block is accessible within the kernel through the built-in blockDim variable.
网格中的每个块都可以通过内核中的内置 blockIdx 变量访问的一维、二维或三维唯一索引来识别。线程块的维度可以通过内核中的内置 blockDim 变量访问。

Extending the previous MatAdd() example to handle multiple blocks, the code becomes as follows.
将前面的 MatAdd() 示例扩展为处理多个块,代码如下所示。

// Kernel definition
__global__ void MatAdd(float A[N][N], float B[N][N],
float C[N][N])
{
    int i = blockIdx.x * blockDim.x + threadIdx.x;
    int j = blockIdx.y * blockDim.y + threadIdx.y;
    if (i < N && j < N)
        C[i][j] = A[i][j] + B[i][j];
}

int main()
{
    ...
    // Kernel invocation
    dim3 threadsPerBlock(16, 16);
    dim3 numBlocks(N / threadsPerBlock.x, N / threadsPerBlock.y);
    MatAdd<<<numBlocks, threadsPerBlock>>>(A, B, C);
    ...
}

A thread block size of 16x16 (256 threads), although arbitrary in this case, is a common choice. The grid is created with enough blocks to have one thread per matrix element as before. For simplicity, this example assumes that the number of threads per grid in each dimension is evenly divisible by the number of threads per block in that dimension, although that need not be the case.
一个大小为 16x16(256 个线程)的线程块,虽然在这种情况下是任意的,但是是一个常见的选择。 网格是用足够的块创建的,以前每个矩阵元素有一个线程。 为简单起见,本示例假定每个维度中网格的线程数可以被该维度中块的线程数均匀地整除,尽管不一定是这种情况。

Thread blocks are required to execute independently: It must be possible to execute them in any order, in parallel or in series. This independence requirement allows thread blocks to be scheduled in any order across any number of cores as illustrated by Figure 3, enabling programmers to write code that scales with the number of cores.
线程块需要独立执行:必须能够以任何顺序并行或串行执行它们。这种独立性要求允许线程块以任何顺序跨任意数量的核心进行调度,如图 3 所示,使程序员能够编写随核心数量扩展的代码。

Threads within a block can cooperate by sharing data through some shared memory and by synchronizing their execution to coordinate memory accesses. More precisely, one can specify synchronization points in the kernel by calling the __syncthreads() intrinsic function; __syncthreads() acts as a barrier at which all threads in the block must wait before any is allowed to proceed. Shared Memory gives an example of using shared memory. In addition to __syncthreads(), the Cooperative Groups API provides a rich set of thread-synchronization primitives.
块内的线程可以通过一些共享内存共享数据,并通过同步它们的执行来协调内存访问。更准确地说,可以通过调用 __syncthreads() 内在函数在内核中指定同步点; __syncthreads() 充当一个屏障,在此屏障前,所有块中的线程都必须等待,然后才允许任何一个继续执行。共享内存提供了使用共享内存的示例。除 __syncthreads() 外,合作组 API 还提供了丰富的线程同步原语。

For efficient cooperation, the shared memory is expected to be a low-latency memory near each processor core (much like an L1 cache) and __syncthreads() is expected to be lightweight.
为了高效协作,共享内存应该是靠近每个处理器核心的低延迟内存(类似于 L1 缓存), __syncthreads() 应该是轻量级的。

2.2.1. Thread Block Clusters
2.2.1. 线程块集群 

With the introduction of NVIDIA Compute Capability 9.0, the CUDA programming model introduces an optional level of hierarchy called Thread Block Clusters that are made up of thread blocks. Similar to how threads in a thread block are guaranteed to be co-scheduled on a streaming multiprocessor, thread blocks in a cluster are also guaranteed to be co-scheduled on a GPU Processing Cluster (GPC) in the GPU.
随着 NVIDIA Compute Capability 9.0 的引入,CUDA 编程模型引入了一个可选的层次结构,称为线程块集群,由线程块组成。与线程块中的线程保证在流多处理器上同时调度类似,集群中的线程块也保证在 GPU 中的 GPU 处理集群(GPC)上同时调度。

Similar to thread blocks, clusters are also organized into a one-dimension, two-dimension, or three-dimension as illustrated by Figure 5. The number of thread blocks in a cluster can be user-defined, and a maximum of 8 thread blocks in a cluster is supported as a portable cluster size in CUDA. Note that on GPU hardware or MIG configurations which are too small to support 8 multiprocessors the maximum cluster size will be reduced accordingly. Identification of these smaller configurations, as well as of larger configurations supporting a thread block cluster size beyond 8, is architecture-specific and can be queried using the cudaOccupancyMaxPotentialClusterSize API.
类似于线程块,集群也可以按照图 5 所示的一维、二维或三维进行组织。集群中线程块的数量可以由用户定义,并且在 CUDA 中支持将最多 8 个线程块组成一个集群作为可移植的集群大小。请注意,在 GPU 硬件或 MIG 配置中,如果太小而无法支持 8 个多处理器,则最大集群大小将相应减小。识别这些较小的配置,以及支持超过 8 个线程块集群大小的较大配置,是与架构相关的,并且可以使用 cudaOccupancyMaxPotentialClusterSize API 进行查询。

Grid of Thread Block Clusters

Figure 5 Grid of Thread Block Clusters
图 5 线程块集群网格 

Note 注意

In a kernel launched using cluster support, the gridDim variable still denotes the size in terms of number of thread blocks, for compatibility purposes. The rank of a block in a cluster can be found using the Cluster Group API.
在使用集群支持启动的内核中,gridDim 变量仍然表示线程块数量,以保持兼容性。可以使用集群组 API 找到集群中块的排名。

A thread block cluster can be enabled in a kernel either using a compiler time kernel attribute using __cluster_dims__(X,Y,Z) or using the CUDA kernel launch API cudaLaunchKernelEx. The example below shows how to launch a cluster using compiler time kernel attribute. The cluster size using kernel attribute is fixed at compile time and then the kernel can be launched using the classical <<< , >>>. If a kernel uses compile-time cluster size, the cluster size cannot be modified when launching the kernel.
线程块集群可以在内核中启用,可以使用编译器时间内核属性使用 __cluster_dims__(X,Y,Z) 或使用 CUDA 内核启动 API cudaLaunchKernelEx 。下面的示例显示了如何使用编译器时间内核属性启动集群。使用内核属性的集群大小在编译时固定,然后可以使用经典的 <<< , >>> 启动内核。如果内核使用编译时集群大小,则在启动内核时无法修改集群大小。

// Kernel definition
// Compile time cluster size 2 in X-dimension and 1 in Y and Z dimension
__global__ void __cluster_dims__(2, 1, 1) cluster_kernel(float *input, float* output)
{

}

int main()
{
    float *input, *output;
    // Kernel invocation with compile time cluster size
    dim3 threadsPerBlock(16, 16);
    dim3 numBlocks(N / threadsPerBlock.x, N / threadsPerBlock.y);

    // The grid dimension is not affected by cluster launch, and is still enumerated
    // using number of blocks.
    // The grid dimension must be a multiple of cluster size.
    cluster_kernel<<<numBlocks, threadsPerBlock>>>(input, output);
}

A thread block cluster size can also be set at runtime and the kernel can be launched using the CUDA kernel launch API cudaLaunchKernelEx. The code example below shows how to launch a cluster kernel using the extensible API.
线程块集群大小也可以在运行时设置,并且可以使用 CUDA 核启动 API cudaLaunchKernelEx 来启动内核。下面的代码示例显示了如何使用可扩展 API 启动集群内核。

// Kernel definition
// No compile time attribute attached to the kernel
__global__ void cluster_kernel(float *input, float* output)
{

}

int main()
{
    float *input, *output;
    dim3 threadsPerBlock(16, 16);
    dim3 numBlocks(N / threadsPerBlock.x, N / threadsPerBlock.y);

    // Kernel invocation with runtime cluster size
    {
        cudaLaunchConfig_t config = {0};
        // The grid dimension is not affected by cluster launch, and is still enumerated
        // using number of blocks.
        // The grid dimension should be a multiple of cluster size.
        config.gridDim = numBlocks;
        config.blockDim = threadsPerBlock;

        cudaLaunchAttribute attribute[1];
        attribute[0].id = cudaLaunchAttributeClusterDimension;
        attribute[0].val.clusterDim.x = 2; // Cluster size in X-dimension
        attribute[0].val.clusterDim.y = 1;
        attribute[0].val.clusterDim.z = 1;
        config.attrs = attribute;
        config.numAttrs = 1;

        cudaLaunchKernelEx(&config, cluster_kernel, input, output);
    }
}

In GPUs with compute capability 9.0, all the thread blocks in the cluster are guaranteed to be co-scheduled on a single GPU Processing Cluster (GPC) and allow thread blocks in the cluster to perform hardware-supported synchronization using the Cluster Group API cluster.sync(). Cluster group also provides member functions to query cluster group size in terms of number of threads or number of blocks using num_threads() and num_blocks() API respectively. The rank of a thread or block in the cluster group can be queried using dim_threads() and dim_blocks() API respectively.
在具有计算能力 9.0 的 GPU 中,集群中的所有线程块都保证在单个 GPU 处理集群 (GPC) 上同时调度,并允许集群中的线程块使用 Cluster Group API cluster.sync() 执行硬件支持的同步。集群组还提供成员函数,使用 num_threads()num_blocks() API 分别查询线程数或块数来查询集群组大小。可以使用 dim_threads()dim_blocks() API 分别查询集群组中线程或块的等级。

Thread blocks that belong to a cluster have access to the Distributed Shared Memory. Thread blocks in a cluster have the ability to read, write, and perform atomics to any address in the distributed shared memory. Distributed Shared Memory gives an example of performing histograms in distributed shared memory.
属于集群的线程块可以访问分布式共享内存。集群中的线程块能够读取、写入和对分布式共享内存中的任何地址执行原子操作。分布式共享内存提供了在分布式共享内存中执行直方图操作的示例。

2.3. Memory Hierarchy
2.3. 内存层次结构 

CUDA threads may access data from multiple memory spaces during their execution as illustrated by Figure 6. Each thread has private local memory. Each thread block has shared memory visible to all threads of the block and with the same lifetime as the block. Thread blocks in a thread block cluster can perform read, write, and atomics operations on each other’s shared memory. All threads have access to the same global memory.
CUDA 线程在执行过程中可以访问多个内存空间中的数据,如图 6 所示。每个线程都有私有的本地内存。每个线程块都有共享内存,所有线程都可以访问,并且具有与线程块相同的生命周期。线程块集群中的线程块可以在彼此的共享内存上执行读取、写入和原子操作。所有线程都可以访问相同的全局内存。

There are also two additional read-only memory spaces accessible by all threads: the constant and texture memory spaces. The global, constant, and texture memory spaces are optimized for different memory usages (see Device Memory Accesses). Texture memory also offers different addressing modes, as well as data filtering, for some specific data formats (see Texture and Surface Memory).
所有线程都可以访问的另外两个只读内存空间是常量内存空间和纹理内存空间。全局、常量和纹理内存空间针对不同的内存使用情况进行了优化(请参阅设备内存访问)。纹理内存还提供不同的寻址模式,以及针对某些特定数据格式的数据过滤(请参阅纹理和表面内存)。

The global, constant, and texture memory spaces are persistent across kernel launches by the same application.
全局、常量和纹理内存空间在同一应用程序的内核启动之间是持久的。

Memory Hierarchy

Figure 6 Memory Hierarchy
图 6 内存层次结构 

2.4. Heterogeneous Programming
2.4. 异构编程 

As illustrated by Figure 7, the CUDA programming model assumes that the CUDA threads execute on a physically separate device that operates as a coprocessor to the host running the C++ program. This is the case, for example, when the kernels execute on a GPU and the rest of the C++ program executes on a CPU.
正如图 7 所示,CUDA 编程模型假定 CUDA 线程在一个物理上独立的设备上执行,该设备作为主机的协处理器运行 C++程序。例如,当内核在 GPU 上执行时,C++程序的其余部分在 CPU 上执行。

The CUDA programming model also assumes that both the host and the device maintain their own separate memory spaces in DRAM, referred to as host memory and device memory, respectively. Therefore, a program manages the global, constant, and texture memory spaces visible to kernels through calls to the CUDA runtime (described in Programming Interface). This includes device memory allocation and deallocation as well as data transfer between host and device memory.
CUDA 编程模型还假定主机和设备在 DRAM 中维护自己独立的内存空间,分别称为主机内存和设备内存。因此,程序通过调用 CUDA 运行时(在编程接口中描述)来管理对内核可见的全局、常量和纹理内存空间。这包括设备内存分配和释放以及主机和设备内存之间的数据传输。

Unified Memory provides managed memory to bridge the host and device memory spaces. Managed memory is accessible from all CPUs and GPUs in the system as a single, coherent memory image with a common address space. This capability enables oversubscription of device memory and can greatly simplify the task of porting applications by eliminating the need to explicitly mirror data on host and device. See Unified Memory Programming for an introduction to Unified Memory.
统一内存提供托管内存,以连接主机和设备内存空间。托管内存可从系统中的所有 CPU 和 GPU 访问,作为具有共同地址空间的单一、一致的内存映像。此功能使设备内存超额分配成为可能,并且可以通过消除在主机和设备上显式镜像数据的需要,极大地简化移植应用程序的任务。请参阅《统一内存编程》以了解统一内存的介绍。

Heterogeneous Programming

Figure 7 Heterogeneous Programming
图 7 异构编程 

Note 注意

Serial code executes on the host while parallel code executes on the device.
串行代码在主机上执行,而并行代码在设备上执行。

2.5. Asynchronous SIMT Programming Model
2.5. 异步 SIMT 编程模型 

In the CUDA programming model a thread is the lowest level of abstraction for doing a computation or a memory operation. Starting with devices based on the NVIDIA Ampere GPU architecture, the CUDA programming model provides acceleration to memory operations via the asynchronous programming model. The asynchronous programming model defines the behavior of asynchronous operations with respect to CUDA threads.
在 CUDA 编程模型中,线程是执行计算或内存操作的最低抽象级别。从基于 NVIDIA Ampere GPU 架构的设备开始,CUDA 编程模型通过异步编程模型为内存操作提供加速。异步编程模型定义了与 CUDA 线程相关的异步操作的行为。

The asynchronous programming model defines the behavior of Asynchronous Barrier for synchronization between CUDA threads. The model also explains and defines how cuda::memcpy_async can be used to move data asynchronously from global memory while computing in the GPU.
异步编程模型定义了 CUDA 线程之间同步的异步屏障的行为。该模型还解释并定义了如何使用 cuda::memcpy_async 在 GPU 中计算时异步移动数据。

2.5.1. Asynchronous Operations
2.5.1. 异步操作 

An asynchronous operation is defined as an operation that is initiated by a CUDA thread and is executed asynchronously as-if by another thread. In a well formed program one or more CUDA threads synchronize with the asynchronous operation. The CUDA thread that initiated the asynchronous operation is not required to be among the synchronizing threads.
异步操作被定义为由 CUDA 线程发起并异步执行的操作,就好像由另一个线程执行一样。在一个良好形成的程序中,一个或多个 CUDA 线程与异步操作同步。发起异步操作的 CUDA 线程不需要是同步线程之一。

Such an asynchronous thread (an as-if thread) is always associated with the CUDA thread that initiated the asynchronous operation. An asynchronous operation uses a synchronization object to synchronize the completion of the operation. Such a synchronization object can be explicitly managed by a user (e.g., cuda::memcpy_async) or implicitly managed within a library (e.g., cooperative_groups::memcpy_async).
这样的异步线程(类似线程)始终与启动异步操作的 CUDA 线程相关联。异步操作使用同步对象来同步操作的完成。这样的同步对象可以由用户显式管理(例如, cuda::memcpy_async )或在库中隐式管理(例如, cooperative_groups::memcpy_async )。

A synchronization object could be a cuda::barrier or a cuda::pipeline. These objects are explained in detail in Asynchronous Barrier and Asynchronous Data Copies using cuda::pipeline. These synchronization objects can be used at different thread scopes. A scope defines the set of threads that may use the synchronization object to synchronize with the asynchronous operation. The following table defines the thread scopes available in CUDA C++ and the threads that can be synchronized with each.
同步对象可以是 cuda::barriercuda::pipeline 。这些对象在使用 cuda::pipeline 进行异步屏障和异步数据拷贝时有详细说明。这些同步对象可以在不同的线程范围内使用。范围定义了可以使用同步对象与异步操作同步的线程集。以下表格定义了 CUDA C++中可用的线程范围以及可以与之同步的线程。

Thread Scope 线程范围

Description 描述

cuda::thread_scope::thread_scope_thread

Only the CUDA thread which initiated asynchronous operations synchronizes.
只有启动异步操作的 CUDA 线程会同步。

cuda::thread_scope::thread_scope_block

All or any CUDA threads within the same thread block as the initiating thread synchronizes.
所有或任何 CUDA 线程与发起线程在同一线程块中同步。

cuda::thread_scope::thread_scope_device

All or any CUDA threads in the same GPU device as the initiating thread synchronizes.
所有或任何 CUDA 线程与发起线程在同一 GPU 设备上同步。

cuda::thread_scope::thread_scope_system

All or any CUDA or CPU threads in the same system as the initiating thread synchronizes.
所有或任何 CUDA 或 CPU 线程与发起线程在同一系统中同步。

These thread scopes are implemented as extensions to standard C++ in the CUDA Standard C++ library.
这些线程范围是作为 CUDA 标准 C++ 库的扩展实现的。

2.6. Compute Capability
2.6. 计算能力 

The compute capability of a device is represented by a version number, also sometimes called its “SM version”. This version number identifies the features supported by the GPU hardware and is used by applications at runtime to determine which hardware features and/or instructions are available on the present GPU.
设备的计算能力由一个版本号表示,有时也称为“SM 版本”。该版本号标识 GPU 硬件支持的特性,并且应用程序在运行时使用它来确定当前 GPU 上可用的硬件特性和/或指令。

The compute capability comprises a major revision number X and a minor revision number Y and is denoted by X.Y.
计算能力包括主要修订号 X 和次要修订号 Y,并表示为 X.Y。

Devices with the same major revision number are of the same core architecture. The major revision number is 9 for devices based on the NVIDIA Hopper GPU architecture, 8 for devices based on the NVIDIA Ampere GPU architecture, 7 for devices based on the Volta architecture, 6 for devices based on the Pascal architecture, 5 for devices based on the Maxwell architecture, and 3 for devices based on the Kepler architecture.
具有相同主要修订号的设备具有相同的核心架构。基于 NVIDIA Hopper GPU 架构的设备的主要修订号为 9,基于 NVIDIA Ampere GPU 架构的设备的主要修订号为 8,基于 Volta 架构的设备的主要修订号为 7,基于 Pascal 架构的设备的主要修订号为 6,基于 Maxwell 架构的设备的主要修订号为 5,基于 Kepler 架构的设备的主要修订号为 3。

The minor revision number corresponds to an incremental improvement to the core architecture, possibly including new features.
次要修订号对应于核心架构的增量改进,可能包括新功能。

Turing is the architecture for devices of compute capability 7.5, and is an incremental update based on the Volta architecture.
Turing 是计算能力为 7.5 的设备架构,是基于 Volta 架构的增量更新。

CUDA-Enabled GPUs lists of all CUDA-enabled devices along with their compute capability. Compute Capabilities gives the technical specifications of each compute capability.
CUDA 启用的 GPU 列出了所有 CUDA 启用的设备以及它们的计算能力。计算能力提供了每个计算能力的技术规格。

Note 注意

The compute capability version of a particular GPU should not be confused with the CUDA version (for example, CUDA 7.5, CUDA 8, CUDA 9), which is the version of the CUDA software platform. The CUDA platform is used by application developers to create applications that run on many generations of GPU architectures, including future GPU architectures yet to be invented. While new versions of the CUDA platform often add native support for a new GPU architecture by supporting the compute capability version of that architecture, new versions of the CUDA platform typically also include software features that are independent of hardware generation.
特定 GPU 的计算能力版本不应与 CUDA 版本混淆(例如,CUDA 7.5、CUDA 8、CUDA 9),CUDA 版本是 CUDA 软件平台的版本。CUDA 平台被应用程序开发人员用来创建在许多代 GPU 架构上运行的应用程序,包括尚未发明的未来 GPU 架构。虽然 CUDA 平台的新版本通常通过支持该架构的计算能力版本来添加对新 GPU 架构的本机支持,但 CUDA 平台的新版本通常还包括与硬件生成无关的软件功能。

The Tesla and Fermi architectures are no longer supported starting with CUDA 7.0 and CUDA 9.0, respectively.
从 CUDA 7.0 和 CUDA 9.0 开始,不再支持 Tesla 和 Fermi 架构。

3. Programming Interface
3. 编程接口 

CUDA C++ provides a simple path for users familiar with the C++ programming language to easily write programs for execution by the device.
CUDA C++为熟悉 C++编程语言的用户提供了一个简单的路径,使他们能够轻松地编写可由设备执行的程序。

It consists of a minimal set of extensions to the C++ language and a runtime library.
它由一组最小的 C++语言扩展和运行时库组成。

The core language extensions have been introduced in Programming Model. They allow programmers to define a kernel as a C++ function and use some new syntax to specify the grid and block dimension each time the function is called. A complete description of all extensions can be found in C++ Language Extensions. Any source file that contains some of these extensions must be compiled with nvcc as outlined in Compilation with NVCC.
核心语言扩展已经在编程模型中引入。它们允许程序员将内核定义为 C++ 函数,并使用一些新的语法来指定每次调用函数时的网格和块维度。所有扩展的完整描述可以在 C++ 语言扩展 中找到。包含其中一些扩展的任何源文件必须按照使用 NVCC 进行编译中所述使用 nvcc 进行编译。

The runtime is introduced in CUDA Runtime. It provides C and C++ functions that execute on the host to allocate and deallocate device memory, transfer data between host memory and device memory, manage systems with multiple devices, etc. A complete description of the runtime can be found in the CUDA reference manual.
运行时是在 CUDA Runtime 中引入的。它提供了在主机上执行的 C 和 C++函数,用于分配和释放设备内存,在主机内存和设备内存之间传输数据,管理具有多个设备的系统等。有关运行时的完整描述可在 CUDA 参考手册中找到。

The runtime is built on top of a lower-level C API, the CUDA driver API, which is also accessible by the application. The driver API provides an additional level of control by exposing lower-level concepts such as CUDA contexts - the analogue of host processes for the device - and CUDA modules - the analogue of dynamically loaded libraries for the device. Most applications do not use the driver API as they do not need this additional level of control and when using the runtime, context and module management are implicit, resulting in more concise code. As the runtime is interoperable with the driver API, most applications that need some driver API features can default to use the runtime API and only use the driver API where needed. The driver API is introduced in Driver API and fully described in the reference manual.
运行时是建立在较低级别的 C API CUDA 驱动程序 API 之上的,应用程序也可以访问该 API。驱动程序 API 通过公开诸如 CUDA 上下文(设备的主机进程的类比)和 CUDA 模块(设备的动态加载库的类比)等较低级别概念,提供了额外的控制级别。大多数应用程序不使用驱动程序 API,因为它们不需要这种额外的控制级别,而在使用运行时时,上下文和模块管理是隐式的,从而导致更简洁的代码。由于运行时与驱动程序 API 兼容,大多数需要一些驱动程序 API 功能的应用程序可以默认使用运行时 API,并仅在需要时使用驱动程序 API。驱动程序 API 在驱动程序 API 中介绍,并在参考手册中进行了全面描述。

3.1. Compilation with NVCC
3.1. 使用 NVCC 进行编译 

Kernels can be written using the CUDA instruction set architecture, called PTX, which is described in the PTX reference manual. It is however usually more effective to use a high-level programming language such as C++. In both cases, kernels must be compiled into binary code by nvcc to execute on the device.
内核可以使用 CUDA 指令集架构编写,称为 PTX,在 PTX 参考手册中有描述。然而,通常更有效的方法是使用高级编程语言,如 C++。在这两种情况下,内核必须由 nvcc 编译成二进制代码,以在设备上执行。

nvcc is a compiler driver that simplifies the process of compiling C++ or PTX code: It provides simple and familiar command line options and executes them by invoking the collection of tools that implement the different compilation stages. This section gives an overview of nvcc workflow and command options. A complete description can be found in the nvcc user manual.
nvcc 是一个编译器驱动程序,简化了编译 C++ 或 PTX 代码的过程:它提供简单且熟悉的命令行选项,并通过调用实现不同编译阶段的工具集合来执行它们。本节概述了 nvcc 的工作流程和命令选项。完整的描述可以在 nvcc 用户手册中找到。

3.1.1. Compilation Workflow
3.1.1. 编译工作流程 

3.1.1.1. Offline Compilation
3.1.1.1. 离线编译 

Source files compiled with nvcc can include a mix of host code (i.e., code that executes on the host) and device code (i.e., code that executes on the device). nvcc’s basic workflow consists in separating device code from host code and then:
源文件使用 nvcc 编译后,可以包含主机代码(即在主机上执行的代码)和设备代码(即在设备上执行的代码)。 nvcc 的基本工作流程包括将设备代码与主机代码分离,然后:

  • compiling the device code into an assembly form (PTX code) and/or binary form (cubin object),
    将设备代码编译成汇编形式(PTX 代码)和/或二进制形式(cubin 对象),

  • and modifying the host code by replacing the <<<...>>> syntax introduced in Kernels (and described in more details in Execution Configuration) by the necessary CUDA runtime function calls to load and launch each compiled kernel from the PTX code and/or cubin object.
    通过用 CUDA 运行时函数调用替换内核中引入的 <<<...>>> 语法(在内核中描述并详细描述执行配置),并修改主机代码,以从 PTX 代码和/或 cubin 对象加载和启动每个已编译的内核。

The modified host code is output either as C++ code that is left to be compiled using another tool or as object code directly by letting nvcc invoke the host compiler during the last compilation stage.
修改后的主机代码要么输出为 C++代码,需要使用另一个工具进行编译,要么直接输出为目标代码,通过让 nvcc 在最后编译阶段调用主机编译器。

Applications can then: 应用程序可以:

  • Either link to the compiled host code (this is the most common case),
    要么链接到编译后的主机代码(这是最常见的情况),

  • Or ignore the modified host code (if any) and use the CUDA driver API (see Driver API) to load and execute the PTX code or cubin object.
    或者忽略修改后的主机代码(如果有)并使用 CUDA 驱动程序 API(请参阅驱动程序 API)来加载和执行 PTX 代码或 cubin 对象。

3.1.1.2. Just-in-Time Compilation
3.1.1.2. 即时编译 

Any PTX code loaded by an application at runtime is compiled further to binary code by the device driver. This is called just-in-time compilation. Just-in-time compilation increases application load time, but allows the application to benefit from any new compiler improvements coming with each new device driver. It is also the only way for applications to run on devices that did not exist at the time the application was compiled, as detailed in Application Compatibility.
应用程序在运行时加载的任何 PTX 代码都会由设备驱动程序进一步编译为二进制代码。这称为即时编译。 即时编译会增加应用程序的加载时间,但允许应用程序受益于每个新设备驱动程序带来的任何新编译器改进。 这也是应用程序在编译应用程序时不存在的设备上运行的唯一方法,详见应用程序兼容性。

When the device driver just-in-time compiles some PTX code for some application, it automatically caches a copy of the generated binary code in order to avoid repeating the compilation in subsequent invocations of the application. The cache - referred to as compute cache - is automatically invalidated when the device driver is upgraded, so that applications can benefit from the improvements in the new just-in-time compiler built into the device driver.
当设备驱动程序为某个应用程序即时编译一些 PTX 代码时,它会自动缓存生成的二进制代码副本,以避免在后续调用应用程序时重复编译。缓存(称为计算缓存)在设备驱动程序升级时会自动失效,以便应用程序可以从内置于设备驱动程序中的新即时编译器的改进中受益。

Environment variables are available to control just-in-time compilation as described in CUDA Environment Variables
环境变量可用于控制即时编译,如 CUDA 环境变量中所述

As an alternative to using nvcc to compile CUDA C++ device code, NVRTC can be used to compile CUDA C++ device code to PTX at runtime. NVRTC is a runtime compilation library for CUDA C++; more information can be found in the NVRTC User guide.
作为使用 nvcc 编译 CUDA C++设备代码的替代方法,可以使用 NVRTC 在运行时将 CUDA C++设备代码编译为 PTX。NVRTC 是用于 CUDA C++的运行时编译库;更多信息可以在 NVRTC 用户指南中找到。

3.1.2. Binary Compatibility
3.1.2. 二进制兼容性 

Binary code is architecture-specific. A cubin object is generated using the compiler option -code that specifies the targeted architecture: For example, compiling with -code=sm_80 produces binary code for devices of compute capability 8.0. Binary compatibility is guaranteed from one minor revision to the next one, but not from one minor revision to the previous one or across major revisions. In other words, a cubin object generated for compute capability X.y will only execute on devices of compute capability X.z where z≥y.
二进制代码是特定于体系结构的。使用编译器选项 -code 生成一个 cubin 对象,该选项指定了目标体系结构:例如,使用 -code=sm_80 进行编译会生成适用于计算能力 8.0 的二进制代码。二进制兼容性在一个次要修订版到下一个次要修订版之间是有保证的,但在一个次要修订版到上一个次要修订版或跨主要修订版之间是没有保证的。换句话说,为计算能力 X.y 生成的 cubin 对象只能在计算能力为 X.z 的设备上执行,其中 z≥y。

Note 注意

Binary compatibility is supported only for the desktop. It is not supported for Tegra. Also, the binary compatibility between desktop and Tegra is not supported.
二进制兼容性仅适用于桌面。不支持 Tegra。此外,桌面和 Tegra 之间的二进制兼容性也不受支持。

3.1.3. PTX Compatibility
3.1.3. PTX 兼容性 

Some PTX instructions are only supported on devices of higher compute capabilities. For example, Warp Shuffle Functions are only supported on devices of compute capability 5.0 and above. The -arch compiler option specifies the compute capability that is assumed when compiling C++ to PTX code. So, code that contains warp shuffle, for example, must be compiled with -arch=compute_50 (or higher).
一些 PTX 指令仅受支持于具有更高计算能力的设备。例如,Warp Shuffle Functions 仅受支持于计算能力为 5.0 及以上的设备。 -arch 编译器选项指定了在将 C++ 编译为 PTX 代码时假定的计算能力。因此,包含 warp shuffle 的代码必须使用 -arch=compute_50 (或更高版本)进行编译。

PTX code produced for some specific compute capability can always be compiled to binary code of greater or equal compute capability. Note that a binary compiled from an earlier PTX version may not make use of some hardware features. For example, a binary targeting devices of compute capability 7.0 (Volta) compiled from PTX generated for compute capability 6.0 (Pascal) will not make use of Tensor Core instructions, since these were not available on Pascal. As a result, the final binary may perform worse than would be possible if the binary were generated using the latest version of PTX.
对于某些特定计算能力生成的 PTX 代码,总是可以编译成更高或相等计算能力的二进制代码。请注意,从较早的 PTX 版本编译的二进制代码可能无法利用一些硬件功能。例如,从为计算能力 6.0(Pascal)生成的 PTX 编译出针对计算能力 7.0(Volta)设备的二进制代码将无法使用张量核心指令,因为这些指令在 Pascal 上不可用。因此,最终的二进制代码可能性能较差,如果使用最新版本的 PTX 生成二进制代码,则可能性能更好。

PTX code compiled to target architecture conditional features only run on the exact same physical architecture and nowhere else. Arch conditional PTX code is not forward and backward compatible. Example code compiled with sm_90a or compute_90a only runs on devices with compute capability 9.0 and is not backward or forward compatible.
PTX 代码编译为目标架构条件特性只在完全相同的物理架构上运行,而在其他地方则不会。架构条件 PTX 代码不具备向前和向后兼容性。使用 sm_90acompute_90a 编译的示例代码仅在计算能力为 9.0 的设备上运行,不具备向后或向前兼容性。

3.1.4. Application Compatibility
3.1.4. 应用程序兼容性 

To execute code on devices of specific compute capability, an application must load binary or PTX code that is compatible with this compute capability as described in Binary Compatibility and PTX Compatibility. In particular, to be able to execute code on future architectures with higher compute capability (for which no binary code can be generated yet), an application must load PTX code that will be just-in-time compiled for these devices (see Just-in-Time Compilation).
要在特定计算能力的设备上执行代码,应用程序必须加载与该计算能力兼容的二进制或 PTX 代码,如《二进制兼容性和 PTX 兼容性》中所述。特别是,为了能够在具有更高计算能力的未来架构上执行代码(尚无法生成二进制代码的架构),应用程序必须加载将为这些设备即时编译的 PTX 代码(请参阅《即时编译》)。

Which PTX and binary code gets embedded in a CUDA C++ application is controlled by the -arch and -code compiler options or the -gencode compiler option as detailed in the nvcc user manual. For example,
CUDA C++ 应用程序中嵌入的 PTX 和二进制代码由 -arch-code 编译器选项或 nvcc 用户手册中详细说明的 -gencode 编译器选项控制。例如,

nvcc x.cu
        -gencode arch=compute_50,code=sm_50
        -gencode arch=compute_60,code=sm_60
        -gencode arch=compute_70,code=\"compute_70,sm_70\"

embeds binary code compatible with compute capability 5.0 and 6.0 (first and second -gencode options) and PTX and binary code compatible with compute capability 7.0 (third -gencode option).
嵌入二进制代码,兼容计算能力 5.0 和 6.0(第一个和第二个 -gencode 选项),以及兼容计算能力 7.0 的 PTX 和二进制代码(第三个 -gencode 选项)。

Host code is generated to automatically select at runtime the most appropriate code to load and execute, which, in the above example, will be:
主机代码是生成的,以便在运行时自动选择最合适的代码来加载和执行,在上面的示例中,将是:

  • 5.0 binary code for devices with compute capability 5.0 and 5.2,
    适用于计算能力为 5.0 和 5.2 的设备的 5.0 二进制代码。

  • 6.0 binary code for devices with compute capability 6.0 and 6.1,
    适用于计算能力为 6.0 和 6.1 的设备的 6.0 二进制代码。

  • 7.0 binary code for devices with compute capability 7.0 and 7.5,
    7.0 计算能力为 7.0 和 7.5 的设备的二进制代码。

  • PTX code which is compiled to binary code at runtime for devices with compute capability 8.0 and 8.6.
    PTX 代码在运行时编译为二进制代码,适用于计算能力为 8.0 和 8.6 的设备。

x.cu can have an optimized code path that uses warp reduction operations, for example, which are only supported in devices of compute capability 8.0 and higher. The __CUDA_ARCH__ macro can be used to differentiate various code paths based on compute capability. It is only defined for device code. When compiling with -arch=compute_80 for example, __CUDA_ARCH__ is equal to 800.
x.cu 可以使用 warp reduction 操作的优化代码路径,例如,这些操作仅受支持于计算能力为 8.0 及更高的设备。 __CUDA_ARCH__ 宏可用于基于计算能力区分各种代码路径。它仅在设备代码中定义。例如,使用 -arch=compute_80 进行编译时, __CUDA_ARCH__ 等于 800

If x.cu is compiled for architecture conditional features example with sm_90a or compute_90a, the code can only run on devices with compute capability 9.0.
如果为架构条件特性示例编译 x.cusm_90acompute_90a ,则代码只能在计算能力为 9.0 的设备上运行。

Applications using the driver API must compile code to separate files and explicitly load and execute the most appropriate file at runtime.
使用驱动程序 API 的应用程序必须将代码编译到单独的文件中,并在运行时显式加载和执行最合适的文件。

The Volta architecture introduces Independent Thread Scheduling which changes the way threads are scheduled on the GPU. For code relying on specific behavior of SIMT scheduling in previous architectures, Independent Thread Scheduling may alter the set of participating threads, leading to incorrect results. To aid migration while implementing the corrective actions detailed in Independent Thread Scheduling, Volta developers can opt-in to Pascal’s thread scheduling with the compiler option combination -arch=compute_60 -code=sm_70.
Volta 架构引入了独立线程调度,改变了 GPU 上线程调度的方式。对于依赖于以往架构中 SIMT 调度特定行为的代码,独立线程调度可能会改变参与线程集合,导致结果不正确。为了在实施独立线程调度中详细说明的纠正操作的同时帮助迁移,Volta 开发人员可以选择使用编译器选项组合 -arch=compute_60 -code=sm_70 进行 Pascal 的线程调度。

The nvcc user manual lists various shorthands for the -arch, -code, and -gencode compiler options. For example, -arch=sm_70 is a shorthand for -arch=compute_70 -code=compute_70,sm_70 (which is the same as -gencode arch=compute_70,code=\"compute_70,sm_70\").
nvcc 用户手册列出了 -arch-code-gencode 编译器选项的各种速记方式。例如, -arch=sm_70-arch=compute_70 -code=compute_70,sm_70 的速记(与 -gencode arch=compute_70,code=\"compute_70,sm_70\" 相同)。

3.1.5. C++ Compatibility
3.1.5. C++ 兼容性 

The front end of the compiler processes CUDA source files according to C++ syntax rules. Full C++ is supported for the host code. However, only a subset of C++ is fully supported for the device code as described in C++ Language Support.
编译器的前端根据 C++语法规则处理 CUDA 源文件。主机代码支持完整的 C++。然而,设备代码仅支持 C++的子集,如 C++语言支持中所述。

3.1.6. 64-Bit Compatibility
3.1.6. 64 位兼容性 

The 64-bit version of nvcc compiles device code in 64-bit mode (i.e., pointers are 64-bit). Device code compiled in 64-bit mode is only supported with host code compiled in 64-bit mode.
nvcc 的 64 位版本在 64 位模式下编译设备代码(即指针为 64 位)。在 64 位模式下编译的设备代码仅支持在 64 位模式下编译的主机代码。

3.2. CUDA Runtime
3.2. CUDA 运行时 

The runtime is implemented in the cudart library, which is linked to the application, either statically via cudart.lib or libcudart.a, or dynamically via cudart.dll or libcudart.so. Applications that require cudart.dll and/or cudart.so for dynamic linking typically include them as part of the application installation package. It is only safe to pass the address of CUDA runtime symbols between components that link to the same instance of the CUDA runtime.
运行时是在 cudart 库中实现的,该库与应用程序链接,可以通过 cudart.liblibcudart.a 静态链接,也可以通过 cudart.dlllibcudart.so 动态链接。通常,需要 cudart.dll 和/或 cudart.so 进行动态链接的应用程序将它们作为应用程序安装包的一部分。只有在链接到相同 CUDA 运行时实例的组件之间传递 CUDA 运行时符号的地址才是安全的。

All its entry points are prefixed with cuda.
所有入口点都以 cuda 为前缀。

As mentioned in Heterogeneous Programming, the CUDA programming model assumes a system composed of a host and a device, each with their own separate memory. Device Memory gives an overview of the runtime functions used to manage device memory.
如异构编程中所述,CUDA 编程模型假定系统由主机和设备组成,每个都有自己独立的内存。设备内存概述了用于管理设备内存的运行时函数。

Shared Memory illustrates the use of shared memory, introduced in Thread Hierarchy, to maximize performance.
共享内存演示了在线程层次结构中引入的共享内存的使用,以最大化性能。

Page-Locked Host Memory introduces page-locked host memory that is required to overlap kernel execution with data transfers between host and device memory.
页面锁定主机内存引入了页面锁定的主机内存,这是必需的,以便在主机和设备内存之间的数据传输与内核执行重叠。

Asynchronous Concurrent Execution describes the concepts and API used to enable asynchronous concurrent execution at various levels in the system.
异步并发执行描述了在系统中各个级别启用异步并发执行所使用的概念和 API。

Multi-Device System shows how the programming model extends to a system with multiple devices attached to the same host.
多设备系统展示了编程模型如何扩展到连接到同一主机的多个设备的系统。

Error Checking describes how to properly check the errors generated by the runtime.
错误检查描述了如何正确检查运行时生成的错误。

Call Stack mentions the runtime functions used to manage the CUDA C++ call stack.
调用堆栈提到了用于管理 CUDA C++调用堆栈的运行时函数。

Texture and Surface Memory presents the texture and surface memory spaces that provide another way to access device memory; they also expose a subset of the GPU texturing hardware.
纹理和表面内存介绍了提供另一种访问设备内存的纹理和表面内存空间;它们还公开了 GPU 纹理硬件的子集。

Graphics Interoperability introduces the various functions the runtime provides to interoperate with the two main graphics APIs, OpenGL and Direct3D.
图形互操作性介绍了运行时提供的各种函数,以便与两个主要的图形 API,OpenGL 和 Direct3D 进行互操作。

3.2.1. Initialization 3.2.1. 初始化 

As of CUDA 12.0, the cudaInitDevice() and cudaSetDevice() calls initialize the runtime and the primary context associated with the specified device. Absent these calls, the runtime will implicitly use device 0 and self-initialize as needed to process other runtime API requests. One needs to keep this in mind when timing runtime function calls and when interpreting the error code from the first call into the runtime. Before 12.0, cudaSetDevice() would not initialize the runtime and applications would often use the no-op runtime call cudaFree(0) to isolate the runtime initialization from other api activity (both for the sake of timing and error handling).
截至 CUDA 12.0, cudaInitDevice()cudaSetDevice() 调用会初始化与指定设备关联的运行时和主要上下文。如果没有这些调用,运行时将隐式使用设备 0 并根据需要自行初始化以处理其他运行时 API 请求。在计时运行时函数调用和解释第一个调用运行时的错误代码时,需要牢记这一点。在 12.0 之前, cudaSetDevice() 不会初始化运行时,应用程序通常会使用无操作运行时调用 cudaFree(0) 来将运行时初始化与其他 API 活动隔离开(无论是为了计时还是错误处理的目的)。

The runtime creates a CUDA context for each device in the system (see Context for more details on CUDA contexts). This context is the primary context for this device and is initialized at the first runtime function which requires an active context on this device. It is shared among all the host threads of the application. As part of this context creation, the device code is just-in-time compiled if necessary (see Just-in-Time Compilation) and loaded into device memory. This all happens transparently. If needed, for example, for driver API interoperability, the primary context of a device can be accessed from the driver API as described in Interoperability between Runtime and Driver APIs.
运行时为系统中的每个设备创建一个 CUDA 上下文(有关 CUDA 上下文的更多详细信息,请参阅上下文)。该上下文是该设备的主要上下文,并在需要在该设备上的第一个运行时函数中初始化活动上下文时初始化。它在应用程序的所有主机线程之间共享。作为此上下文创建的一部分,如果需要,设备代码将按需即时编译(请参阅即时编译)并加载到设备内存中。所有这些都是透明的。如果需要,例如用于驱动程序 API 的互操作性,可以从驱动程序 API 中访问设备的主要上下文,如在运行时和驱动程序 API 之间的互操作性中所述。

When a host thread calls cudaDeviceReset(), this destroys the primary context of the device the host thread currently operates on (i.e., the current device as defined in Device Selection). The next runtime function call made by any host thread that has this device as current will create a new primary context for this device.
当主机线程调用 cudaDeviceReset() 时,这将销毁主机线程当前操作的设备的主要上下文(即,在设备选择中定义的当前设备)。任何具有此设备作为当前设备的主机线程进行的下一个运行时函数调用将为此设备创建一个新的主要上下文。

Note 注意

The CUDA interfaces use global state that is initialized during host program initiation and destroyed during host program termination. The CUDA runtime and driver cannot detect if this state is invalid, so using any of these interfaces (implicitly or explicitly) during program initiation or termination after main) will result in undefined behavior.
CUDA 接口使用全局状态,在主机程序初始化期间进行初始化,并在主机程序终止期间进行销毁。CUDA 运行时和驱动程序无法检测此状态是否无效,因此在程序初始化期间或终止期间(main 之后)使用这些接口(隐式或显式)将导致未定义行为。

As of CUDA 12.0, cudaSetDevice() will now explicitly initialize the runtime after changing the current device for the host thread. Previous versions of CUDA delayed runtime initialization on the new device until the first runtime call was made after cudaSetDevice(). This change means that it is now very important to check the return value of cudaSetDevice() for initialization errors.
截至 CUDA 12.0, cudaSetDevice() 现在在更改主机线程的当前设备后明确初始化运行时。CUDA 的早期版本在新设备上延迟运行时初始化,直到在 cudaSetDevice() 之后进行第一次运行时调用。这一变更意味着现在非常重要检查 cudaSetDevice() 的返回值以查找初始化错误。

The runtime functions from the error handling and version management sections of the reference manual do not initialize the runtime.
参考手册中错误处理和版本管理部分的运行时函数不会初始化运行时。

3.2.2. Device Memory
3.2.2. 设备内存 

As mentioned in Heterogeneous Programming, the CUDA programming model assumes a system composed of a host and a device, each with their own separate memory. Kernels operate out of device memory, so the runtime provides functions to allocate, deallocate, and copy device memory, as well as transfer data between host memory and device memory.
如异构编程中所述,CUDA 编程模型假定系统由主机和设备组成,每个都有自己独立的内存。内核在设备内存中运行,因此运行时提供了函数来分配、释放和复制设备内存,以及在主机内存和设备内存之间传输数据。

Device memory can be allocated either as linear memory or as CUDA arrays.
设备内存可以分配为线性内存或 CUDA 数组。

CUDA arrays are opaque memory layouts optimized for texture fetching. They are described in Texture and Surface Memory.
CUDA 数组是为纹理获取优化的不透明内存布局。它们在“纹理和表面内存”中有描述。

Linear memory is allocated in a single unified address space, which means that separately allocated entities can reference one another via pointers, for example, in a binary tree or linked list. The size of the address space depends on the host system (CPU) and the compute capability of the used GPU:
线性内存在单一统一的地址空间中分配,这意味着单独分配的实体可以通过指针相互引用,例如在二叉树或链表中。地址空间的大小取决于主机系统(CPU)和所使用的 GPU 的计算能力:

Table 1 Linear Memory Address Space
表 1 线性内存地址空间 

x86_64 (AMD64) x86_64(AMD64)

POWER (ppc64le) POWER(ppc64le)

ARM64

up to compute capability 5.3 (Maxwell)
直到计算能力为 5.3(Maxwell)

40bit 40 位

40bit 40 位

40bit 40 位

compute capability 6.0 (Pascal) or newer
计算能力 6.0(Pascal)或更新版本

up to 47bit 最多 47 位

up to 49bit 最多 49 位

up to 48bit 最多 48 位

Note 注意

On devices of compute capability 5.3 (Maxwell) and earlier, the CUDA driver creates an uncommitted 40bit virtual address reservation to ensure that memory allocations (pointers) fall into the supported range. This reservation appears as reserved virtual memory, but does not occupy any physical memory until the program actually allocates memory.
在计算能力为 5.3(Maxwell)及更早版本的设备上,CUDA 驱动程序会创建一个未提交的 40 位虚拟地址保留空间,以确保内存分配(指针)落入支持的范围。这个保留空间会显示为保留的虚拟内存,但在程序实际分配内存之前不会占用任何物理内存。

Linear memory is typically allocated using cudaMalloc() and freed using cudaFree() and data transfer between host memory and device memory are typically done using cudaMemcpy(). In the vector addition code sample of Kernels, the vectors need to be copied from host memory to device memory:
线性内存通常使用 cudaMalloc() 分配,并使用 cudaFree() 释放,主机内存和设备内存之间的数据传输通常使用 cudaMemcpy() 完成。在内核的矢量加法代码示例中,需要将矢量从主机内存复制到设备内存:

// Device code
__global__ void VecAdd(float* A, float* B, float* C, int N)
{
    int i = blockDim.x * blockIdx.x + threadIdx.x;
    if (i < N)
        C[i] = A[i] + B[i];
}

// Host code
int main()
{
    int N = ...;
    size_t size = N * sizeof(float);

    // Allocate input vectors h_A and h_B in host memory
    float* h_A = (float*)malloc(size);
    float* h_B = (float*)malloc(size);
    float* h_C = (float*)malloc(size);

    // Initialize input vectors
    ...

    // Allocate vectors in device memory
    float* d_A;
    cudaMalloc(&d_A, size);
    float* d_B;
    cudaMalloc(&d_B, size);
    float* d_C;
    cudaMalloc(&d_C, size);

    // Copy vectors from host memory to device memory
    cudaMemcpy(d_A, h_A, size, cudaMemcpyHostToDevice);
    cudaMemcpy(d_B, h_B, size, cudaMemcpyHostToDevice);

    // Invoke kernel
    int threadsPerBlock = 256;
    int blocksPerGrid =
            (N + threadsPerBlock - 1) / threadsPerBlock;
    VecAdd<<<blocksPerGrid, threadsPerBlock>>>(d_A, d_B, d_C, N);

    // Copy result from device memory to host memory
    // h_C contains the result in host memory
    cudaMemcpy(h_C, d_C, size, cudaMemcpyDeviceToHost);

    // Free device memory
    cudaFree(d_A);
    cudaFree(d_B);
    cudaFree(d_C);

    // Free host memory
    ...
}

Linear memory can also be allocated through cudaMallocPitch() and cudaMalloc3D(). These functions are recommended for allocations of 2D or 3D arrays as it makes sure that the allocation is appropriately padded to meet the alignment requirements described in Device Memory Accesses, therefore ensuring best performance when accessing the row addresses or performing copies between 2D arrays and other regions of device memory (using the cudaMemcpy2D() and cudaMemcpy3D() functions). The returned pitch (or stride) must be used to access array elements. The following code sample allocates a width x height 2D array of floating-point values and shows how to loop over the array elements in device code:
线性内存也可以通过 cudaMallocPitch()cudaMalloc3D() 进行分配。建议使用这些函数来分配 2D 或 3D 数组,因为这样可以确保分配适当地填充,以满足设备内存访问中描述的对齐要求,从而确保在访问行地址或在 2D 数组和设备内存的其他区域之间执行复制时获得最佳性能(使用 cudaMemcpy2D()cudaMemcpy3D() 函数)。返回的间距(或步幅)必须用于访问数组元素。以下代码示例分配了一个 width x height 的浮点值 2D 数组,并展示了如何在设备代码中循环访问数组元素:

// Host code
int width = 64, height = 64;
float* devPtr;
size_t pitch;
cudaMallocPitch(&devPtr, &pitch,
                width * sizeof(float), height);
MyKernel<<<100, 512>>>(devPtr, pitch, width, height);

// Device code
__global__ void MyKernel(float* devPtr,
                         size_t pitch, int width, int height)
{
    for (int r = 0; r < height; ++r) {
        float* row = (float*)((char*)devPtr + r * pitch);
        for (int c = 0; c < width; ++c) {
            float element = row[c];
        }
    }
}

The following code sample allocates a width x height x depth 3D array of floating-point values and shows how to loop over the array elements in device code:
以下代码示例分配了一个 width x height x depth 的浮点值 3D 数组,并展示了如何在设备代码中循环遍历数组元素:

// Host code
int width = 64, height = 64, depth = 64;
cudaExtent extent = make_cudaExtent(width * sizeof(float),
                                    height, depth);
cudaPitchedPtr devPitchedPtr;
cudaMalloc3D(&devPitchedPtr, extent);
MyKernel<<<100, 512>>>(devPitchedPtr, width, height, depth);

// Device code
__global__ void MyKernel(cudaPitchedPtr devPitchedPtr,
                         int width, int height, int depth)
{
    char* devPtr = devPitchedPtr.ptr;
    size_t pitch = devPitchedPtr.pitch;
    size_t slicePitch = pitch * height;
    for (int z = 0; z < depth; ++z) {
        char* slice = devPtr + z * slicePitch;
        for (int y = 0; y < height; ++y) {
            float* row = (float*)(slice + y * pitch);
            for (int x = 0; x < width; ++x) {
                float element = row[x];
            }
        }
    }
}

Note 注意

To avoid allocating too much memory and thus impacting system-wide performance, request the allocation parameters from the user based on the problem size. If the allocation fails, you can fallback to other slower memory types (cudaMallocHost(), cudaHostRegister(), etc.), or return an error telling the user how much memory was needed that was denied. If your application cannot request the allocation parameters for some reason, we recommend using cudaMallocManaged() for platforms that support it.
为了避免分配过多内存从而影响整个系统的性能,请根据问题规模从用户那里请求分配参数。如果分配失败,您可以退而使用其他较慢的内存类型( cudaMallocHost()cudaHostRegister() 等),或者返回一个错误,告诉用户拒绝了多少所需内存。如果您的应用由于某种原因无法请求分配参数,我们建议在支持的平台上使用 cudaMallocManaged()

The reference manual lists all the various functions used to copy memory between linear memory allocated with cudaMalloc(), linear memory allocated with cudaMallocPitch() or cudaMalloc3D(), CUDA arrays, and memory allocated for variables declared in global or constant memory space.
参考手册列出了用于在使用 cudaMalloc() 分配的线性内存、使用 cudaMallocPitch()cudaMalloc3D() 分配的线性内存、CUDA 数组以及为全局或常量内存空间声明的变量分配的内存之间复制内存的各种函数。

The following code sample illustrates various ways of accessing global variables via the runtime API:
以下代码示例说明了通过运行时 API 访问全局变量的各种方式:

__constant__ float constData[256];
float data[256];
cudaMemcpyToSymbol(constData, data, sizeof(data));
cudaMemcpyFromSymbol(data, constData, sizeof(data));

__device__ float devData;
float value = 3.14f;
cudaMemcpyToSymbol(devData, &value, sizeof(float));

__device__ float* devPointer;
float* ptr;
cudaMalloc(&ptr, 256 * sizeof(float));
cudaMemcpyToSymbol(devPointer, &ptr, sizeof(ptr));

cudaGetSymbolAddress() is used to retrieve the address pointing to the memory allocated for a variable declared in global memory space. The size of the allocated memory is obtained through cudaGetSymbolSize().
cudaGetSymbolAddress() 用于检索指向在全局内存空间中声明的变量分配的内存地址。分配的内存大小通过 cudaGetSymbolSize() 获得。

3.2.3. Device Memory L2 Access Management
3.2.3. 设备内存 L2 访问管理 

When a CUDA kernel accesses a data region in the global memory repeatedly, such data accesses can be considered to be persisting. On the other hand, if the data is only accessed once, such data accesses can be considered to be streaming.
当 CUDA 核心重复访问全局内存中的数据区域时,这些数据访问可以被视为持久性的。另一方面,如果数据只被访问一次,这些数据访问可以被视为流式的。

Starting with CUDA 11.0, devices of compute capability 8.0 and above have the capability to influence persistence of data in the L2 cache, potentially providing higher bandwidth and lower latency accesses to global memory.
从 CUDA 11.0 开始,计算能力为 8.0 及以上的设备具有影响 L2 缓存数据持久性的能力,可能提供更高带宽和更低延迟访问全局内存。

3.2.3.1. L2 cache Set-Aside for Persisting Accesses
3.2.3.1. 专用于持久访问的 L2 缓存保留

A portion of the L2 cache can be set aside to be used for persisting data accesses to global memory. Persisting accesses have prioritized use of this set-aside portion of L2 cache, whereas normal or streaming, accesses to global memory can only utilize this portion of L2 when it is unused by persisting accesses.
L2 缓存的一部分可以被设置为用于持久化数据访问全局内存。持久化访问优先使用 L2 缓存的这部分保留空间,而普通或流式访问全局内存只能在持久化访问未使用时利用这部分 L2 缓存。

The L2 cache set-aside size for persisting accesses may be adjusted, within limits:
持久访问的 L2 缓存保留大小可以在一定范围内进行调整:

cudaGetDeviceProperties(&prop, device_id);
size_t size = min(int(prop.l2CacheSize * 0.75), prop.persistingL2CacheMaxSize);
cudaDeviceSetLimit(cudaLimitPersistingL2CacheSize, size); /* set-aside 3/4 of L2 cache for persisting accesses or the max allowed*/

When the GPU is configured in Multi-Instance GPU (MIG) mode, the L2 cache set-aside functionality is disabled.
当 GPU 配置为多实例 GPU(MIG)模式时,L2 缓存保留功能被禁用。

When using the Multi-Process Service (MPS), the L2 cache set-aside size cannot be changed by cudaDeviceSetLimit. Instead, the set-aside size can only be specified at start up of MPS server through the environment variable CUDA_DEVICE_DEFAULT_PERSISTING_L2_CACHE_PERCENTAGE_LIMIT.
当使用多进程服务(MPS)时,无法通过 cudaDeviceSetLimit 更改 L2 缓存保留大小。相反,只能通过环境变量 CUDA_DEVICE_DEFAULT_PERSISTING_L2_CACHE_PERCENTAGE_LIMIT 在启动 MPS 服务器时指定保留大小。

3.2.3.2. L2 Policy for Persisting Accesses
3.2.3.2. 持久访问的 L2 策略 

An access policy window specifies a contiguous region of global memory and a persistence property in the L2 cache for accesses within that region.
访问策略窗口指定了全局内存中的一个连续区域和 L2 缓存中的持久性属性,用于该区域内的访问。

The code example below shows how to set an L2 persisting access window using a CUDA Stream.
下面的代码示例显示了如何使用 CUDA Stream 设置一个 L2 持久访问窗口。

CUDA Stream Example CUDA 流示例

cudaStreamAttrValue stream_attribute;                                         // Stream level attributes data structure
stream_attribute.accessPolicyWindow.base_ptr  = reinterpret_cast<void*>(ptr); // Global Memory data pointer
stream_attribute.accessPolicyWindow.num_bytes = num_bytes;                    // Number of bytes for persistence access.
                                                                              // (Must be less than cudaDeviceProp::accessPolicyMaxWindowSize)
stream_attribute.accessPolicyWindow.hitRatio  = 0.6;                          // Hint for cache hit ratio
stream_attribute.accessPolicyWindow.hitProp   = cudaAccessPropertyPersisting; // Type of access property on cache hit
stream_attribute.accessPolicyWindow.missProp  = cudaAccessPropertyStreaming;  // Type of access property on cache miss.

//Set the attributes to a CUDA stream of type cudaStream_t
cudaStreamSetAttribute(stream, cudaStreamAttributeAccessPolicyWindow, &stream_attribute);

When a kernel subsequently executes in CUDA stream, memory accesses within the global memory extent [ptr..ptr+num_bytes) are more likely to persist in the L2 cache than accesses to other global memory locations.
当一个内核随后在 CUDA stream 中执行时,全局内存范围内的内存访问更有可能保留在 L2 缓存中,而不是访问其他全局内存位置。

L2 persistence can also be set for a CUDA Graph Kernel Node as shown in the example below:
L2 持久性也可以像下面的示例中所示为 CUDA 图内核节点设置:

CUDA GraphKernelNode Example
CUDA GraphKernelNode 示例

cudaKernelNodeAttrValue node_attribute;                                     // Kernel level attributes data structure
node_attribute.accessPolicyWindow.base_ptr  = reinterpret_cast<void*>(ptr); // Global Memory data pointer
node_attribute.accessPolicyWindow.num_bytes = num_bytes;                    // Number of bytes for persistence access.
                                                                            // (Must be less than cudaDeviceProp::accessPolicyMaxWindowSize)
node_attribute.accessPolicyWindow.hitRatio  = 0.6;                          // Hint for cache hit ratio
node_attribute.accessPolicyWindow.hitProp   = cudaAccessPropertyPersisting; // Type of access property on cache hit
node_attribute.accessPolicyWindow.missProp  = cudaAccessPropertyStreaming;  // Type of access property on cache miss.

//Set the attributes to a CUDA Graph Kernel node of type cudaGraphNode_t
cudaGraphKernelNodeSetAttribute(node, cudaKernelNodeAttributeAccessPolicyWindow, &node_attribute);

The hitRatio parameter can be used to specify the fraction of accesses that receive the hitProp property. In both of the examples above, 60% of the memory accesses in the global memory region [ptr..ptr+num_bytes) have the persisting property and 40% of the memory accesses have the streaming property. Which specific memory accesses are classified as persisting (the hitProp) is random with a probability of approximately hitRatio; the probability distribution depends upon the hardware architecture and the memory extent.
参数 hitRatio 可用于指定接收 hitProp 属性的访问比例。在上述两个示例中,全局内存区域中 60%的内存访问具有持久性属性,40%的内存访问具有流式属性。被分类为持久性的特定内存访问( hitProp )是随机的,概率约为 hitRatio ;概率分布取决于硬件架构和内存范围。

For example, if the L2 set-aside cache size is 16KB and the num_bytes in the accessPolicyWindow is 32KB:
例如,如果 L2 保留缓存大小为 16KB,并且 num_bytes 中的 accessPolicyWindow 为 32KB:

  • With a hitRatio of 0.5, the hardware will select, at random, 16KB of the 32KB window to be designated as persisting and cached in the set-aside L2 cache area.
    hitRatio 为 0.5 时,硬件将随机选择 32KB 窗口中的 16KB 作为持久性缓存,并缓存在专用的 L2 缓存区域中。

  • With a hitRatio of 1.0, the hardware will attempt to cache the whole 32KB window in the set-aside L2 cache area. Since the set-aside area is smaller than the window, cache lines will be evicted to keep the most recently used 16KB of the 32KB data in the set-aside portion of the L2 cache.
    当 ` hitRatio ` 为 1.0 时,硬件将尝试在专用的 L2 缓存区域中缓存整个 32KB 窗口。由于专用区域比窗口小,缓存行将被逐出,以保持最近使用的 32KB 数据中的最近使用的 16KB 部分在 L2 缓存的专用部分中。

The hitRatio can therefore be used to avoid thrashing of cache lines and overall reduce the amount of data moved into and out of the L2 cache.
因此, hitRatio 可用于避免缓存行抖动,并从总体上减少移入和移出 L2 缓存的数据量。

A hitRatio value below 1.0 can be used to manually control the amount of data different accessPolicyWindows from concurrent CUDA streams can cache in L2. For example, let the L2 set-aside cache size be 16KB; two concurrent kernels in two different CUDA streams, each with a 16KB accessPolicyWindow, and both with hitRatio value 1.0, might evict each others’ cache lines when competing for the shared L2 resource. However, if both accessPolicyWindows have a hitRatio value of 0.5, they will be less likely to evict their own or each others’ persisting cache lines.
值低于 1.0 的 hitRatio 可以用于手动控制并发 CUDA 流中不同 accessPolicyWindow 在 L2 中缓存的数据量。例如,让 L2 保留缓存大小为 16KB;两个不同 CUDA 流中的并发内核,每个内核都有一个 16KB 的 accessPolicyWindow ,且两者的 hitRatio 值均为 1.0 时,它们在竞争共享 L2 资源时可能会驱逐彼此的缓存行。然而,如果两个 accessPolicyWindows 的 hitRatio 值均为 0.5,则它们更不太可能驱逐自己或彼此的持久缓存行。

3.2.3.3. L2 Access Properties
3.2.3.3. L2 访问属性 

Three types of access properties are defined for different global memory data accesses:
为不同的全局内存数据访问定义了三种类型的访问属性:

  1. cudaAccessPropertyStreaming: Memory accesses that occur with the streaming property are less likely to persist in the L2 cache because these accesses are preferentially evicted.
    cudaAccessPropertyStreaming :具有流式属性的内存访问不太可能在 L2 缓存中保留,因为这些访问会被优先驱逐。

  2. cudaAccessPropertyPersisting: Memory accesses that occur with the persisting property are more likely to persist in the L2 cache because these accesses are preferentially retained in the set-aside portion of L2 cache.
    cudaAccessPropertyPersisting :具有持久性属性的内存访问更有可能在 L2 缓存中持久存在,因为这些访问优先保留在 L2 缓存的专用部分。

  3. cudaAccessPropertyNormal: This access property forcibly resets previously applied persisting access property to a normal status. Memory accesses with the persisting property from previous CUDA kernels may be retained in L2 cache long after their intended use. This persistence-after-use reduces the amount of L2 cache available to subsequent kernels that do not use the persisting property. Resetting an access property window with the cudaAccessPropertyNormal property removes the persisting (preferential retention) status of the prior access, as if the prior access had been without an access property.
    cudaAccessPropertyNormal :此访问属性强制将先前应用的持久访问属性重置为正常状态。来自先前 CUDA 内核的具有持久属性的内存访问可能会在其预期使用后长时间保留在 L2 缓存中。此持久性使用后会减少对不使用持久属性的后续内核可用的 L2 缓存量。使用 cudaAccessPropertyNormal 属性重置访问属性窗口会移除先前访问的持久(优先保留)状态,就好像先前访问没有访问属性一样。

3.2.3.4. L2 Persistence Example
3.2.3.4. L2 持久性示例 

The following example shows how to set-aside L2 cache for persistent accesses, use the set-aside L2 cache in CUDA kernels via CUDA Stream and then reset the L2 cache.
以下示例显示了如何为持久访问设置 L2 缓存,通过 CUDA 流在 CUDA 内核中使用设置 L2 缓存,然后重置 L2 缓存。

cudaStream_t stream;
cudaStreamCreate(&stream);                                                                  // Create CUDA stream

cudaDeviceProp prop;                                                                        // CUDA device properties variable
cudaGetDeviceProperties( &prop, device_id);                                                 // Query GPU properties
size_t size = min( int(prop.l2CacheSize * 0.75) , prop.persistingL2CacheMaxSize );
cudaDeviceSetLimit( cudaLimitPersistingL2CacheSize, size);                                  // set-aside 3/4 of L2 cache for persisting accesses or the max allowed

size_t window_size = min(prop.accessPolicyMaxWindowSize, num_bytes);                        // Select minimum of user defined num_bytes and max window size.

cudaStreamAttrValue stream_attribute;                                                       // Stream level attributes data structure
stream_attribute.accessPolicyWindow.base_ptr  = reinterpret_cast<void*>(data1);               // Global Memory data pointer
stream_attribute.accessPolicyWindow.num_bytes = window_size;                                // Number of bytes for persistence access
stream_attribute.accessPolicyWindow.hitRatio  = 0.6;                                        // Hint for cache hit ratio
stream_attribute.accessPolicyWindow.hitProp   = cudaAccessPropertyPersisting;               // Persistence Property
stream_attribute.accessPolicyWindow.missProp  = cudaAccessPropertyStreaming;                // Type of access property on cache miss

cudaStreamSetAttribute(stream, cudaStreamAttributeAccessPolicyWindow, &stream_attribute);   // Set the attributes to a CUDA Stream

for(int i = 0; i < 10; i++) {
    cuda_kernelA<<<grid_size,block_size,0,stream>>>(data1);                                 // This data1 is used by a kernel multiple times
}                                                                                           // [data1 + num_bytes) benefits from L2 persistence
cuda_kernelB<<<grid_size,block_size,0,stream>>>(data1);                                     // A different kernel in the same stream can also benefit
                                                                                            // from the persistence of data1

stream_attribute.accessPolicyWindow.num_bytes = 0;                                          // Setting the window size to 0 disable it
cudaStreamSetAttribute(stream, cudaStreamAttributeAccessPolicyWindow, &stream_attribute);   // Overwrite the access policy attribute to a CUDA Stream
cudaCtxResetPersistingL2Cache();                                                            // Remove any persistent lines in L2

cuda_kernelC<<<grid_size,block_size,0,stream>>>(data2);                                     // data2 can now benefit from full L2 in normal mode

3.2.3.5. Reset L2 Access to Normal
3.2.3.5. 将 L2 访问重置为正常 

A persisting L2 cache line from a previous CUDA kernel may persist in L2 long after it has been used. Hence, a reset to normal for L2 cache is important for streaming or normal memory accesses to utilize the L2 cache with normal priority. There are three ways a persisting access can be reset to normal status.
来自先前 CUDA 内核的持久 L2 缓存行可能在使用后长时间保留在 L2 中。因此,对于流式或正常内存访问,重置 L2 缓存以正常状态非常重要,以便以正常优先级利用 L2 缓存。有三种方法可以将持久访问重置为正常状态。

  1. Reset a previous persisting memory region with the access property, cudaAccessPropertyNormal.
    重置具有访问属性的先前持久化内存区域, cudaAccessPropertyNormal

  2. Reset all persisting L2 cache lines to normal by calling cudaCtxResetPersistingL2Cache().
    重置所有持久化的 L2 缓存行,通过调用 cudaCtxResetPersistingL2Cache()

  3. Eventually untouched lines are automatically reset to normal. Reliance on automatic reset is strongly discouraged because of the undetermined length of time required for automatic reset to occur.
    最终未更改的行将自动重置为正常状态。强烈不建议依赖自动重置,因为自动重置所需的时间长度不确定。

3.2.3.6. Manage Utilization of L2 set-aside cache
3.2.3.6. 管理 L2 保留缓存的利用率 

Multiple CUDA kernels executing concurrently in different CUDA streams may have a different access policy window assigned to their streams. However, the L2 set-aside cache portion is shared among all these concurrent CUDA kernels. As a result, the net utilization of this set-aside cache portion is the sum of all the concurrent kernels’ individual use. The benefits of designating memory accesses as persisting diminish as the volume of persisting accesses exceeds the set-aside L2 cache capacity.
在不同的 CUDA 流中并发执行多个 CUDA 内核可能会为它们的流分配不同的访问策略窗口。然而,L2 保留缓存部分在所有这些并发的 CUDA 内核之间共享。因此,这个保留缓存部分的净利用率是所有并发内核个体使用的总和。将内存访问指定为持久访问的好处会随着持久访问量超过保留的 L2 缓存容量而减少。

To manage utilization of the set-aside L2 cache portion, an application must consider the following:
为了管理保留的 L2 缓存部分的利用率,应用程序必须考虑以下内容:

  • Size of L2 set-aside cache.
    L2 预留缓存的大小。

  • CUDA kernels that may concurrently execute.
    可能同时执行的 CUDA 内核。

  • The access policy window for all the CUDA kernels that may concurrently execute.
    可能同时执行的所有 CUDA 内核的访问策略窗口。

  • When and how L2 reset is required to allow normal or streaming accesses to utilize the previously set-aside L2 cache with equal priority.
    何时以及如何需要 L2 重置,以允许正常或流式访问利用之前设置的具有相同优先级的 L2 缓存。

3.2.3.7. Query L2 cache Properties
3.2.3.7. 查询 L2 缓存属性 

Properties related to L2 cache are a part of cudaDeviceProp struct and can be queried using CUDA runtime API cudaGetDeviceProperties
与 L2 缓存相关的属性是 cudaDeviceProp 结构的一部分,可以使用 CUDA 运行时 API cudaGetDeviceProperties 进行查询

CUDA Device Properties include:
CUDA 设备属性包括:

  • l2CacheSize: The amount of available L2 cache on the GPU.
    l2CacheSize :GPU 上可用的 L2 缓存量。

  • persistingL2CacheMaxSize: The maximum amount of L2 cache that can be set-aside for persisting memory accesses.
    persistingL2CacheMaxSize :可用于持久化内存访问的 L2 缓存的最大数量。

  • accessPolicyMaxWindowSize: The maximum size of the access policy window.
    accessPolicyMaxWindowSize :访问策略窗口的最大尺寸。

3.2.3.8. Control L2 Cache Set-Aside Size for Persisting Memory Access
3.2.3.8. 控制 L2 缓存保留大小以持久化内存访问 

The L2 set-aside cache size for persisting memory accesses is queried using CUDA runtime API cudaDeviceGetLimit and set using CUDA runtime API cudaDeviceSetLimit as a cudaLimit. The maximum value for setting this limit is cudaDeviceProp::persistingL2CacheMaxSize.
使用 CUDA 运行时 API cudaDeviceGetLimit 查询用于持久化内存访问的 L2 保留缓存大小,并使用 CUDA 运行时 API cudaDeviceSetLimit 设置为 cudaLimit 。设置此限制的最大值为 cudaDeviceProp::persistingL2CacheMaxSize

enum cudaLimit {
    /* other fields not shown */
    cudaLimitPersistingL2CacheSize
};

3.2.4. Shared Memory
3.2.4. 共享内存 

As detailed in Variable Memory Space Specifiers shared memory is allocated using the __shared__ memory space specifier.
如《变量内存空间说明》中所述,共享内存是使用 __shared__ 内存空间限定符分配的。

Shared memory is expected to be much faster than global memory as mentioned in Thread Hierarchy and detailed in Shared Memory. It can be used as scratchpad memory (or software managed cache) to minimize global memory accesses from a CUDA block as illustrated by the following matrix multiplication example.
共享内存预计比全局内存快得多,如线程层次结构中所述,并在共享内存中详细说明。它可以被用作临时存储器(或软件管理的缓存),以减少来自 CUDA 块的全局内存访问,正如下面的矩阵乘法示例所示。

The following code sample is a straightforward implementation of matrix multiplication that does not take advantage of shared memory. Each thread reads one row of A and one column of B and computes the corresponding element of C as illustrated in Figure 8. A is therefore read B.width times from global memory and B is read A.height times.
以下代码示例是矩阵乘法的直接实现,不利用共享内存。每个线程读取矩阵 A 的一行和矩阵 B 的一列,并计算矩阵 C 的相应元素,如图 8 所示。因此,矩阵 A 从全局内存中读取了 B.width 次,矩阵 B 从 A.height 次。

// Matrices are stored in row-major order:
// M(row, col) = *(M.elements + row * M.width + col)
typedef struct {
    int width;
    int height;
    float* elements;
} Matrix;

// Thread block size
#define BLOCK_SIZE 16

// Forward declaration of the matrix multiplication kernel
__global__ void MatMulKernel(const Matrix, const Matrix, Matrix);

// Matrix multiplication - Host code
// Matrix dimensions are assumed to be multiples of BLOCK_SIZE
void MatMul(const Matrix A, const Matrix B, Matrix C)
{
    // Load A and B to device memory
    Matrix d_A;
    d_A.width = A.width; d_A.height = A.height;
    size_t size = A.width * A.height * sizeof(float);
    cudaMalloc(&d_A.elements, size);
    cudaMemcpy(d_A.elements, A.elements, size,
               cudaMemcpyHostToDevice);
    Matrix d_B;
    d_B.width = B.width; d_B.height = B.height;
    size = B.width * B.height * sizeof(float);
    cudaMalloc(&d_B.elements, size);
    cudaMemcpy(d_B.elements, B.elements, size,
               cudaMemcpyHostToDevice);

    // Allocate C in device memory
    Matrix d_C;
    d_C.width = C.width; d_C.height = C.height;
    size = C.width * C.height * sizeof(float);
    cudaMalloc(&d_C.elements, size);

    // Invoke kernel
    dim3 dimBlock(BLOCK_SIZE, BLOCK_SIZE);
    dim3 dimGrid(B.width / dimBlock.x, A.height / dimBlock.y);
    MatMulKernel<<<dimGrid, dimBlock>>>(d_A, d_B, d_C);

    // Read C from device memory
    cudaMemcpy(C.elements, d_C.elements, size,
               cudaMemcpyDeviceToHost);

    // Free device memory
    cudaFree(d_A.elements);
    cudaFree(d_B.elements);
    cudaFree(d_C.elements);
}

// Matrix multiplication kernel called by MatMul()
__global__ void MatMulKernel(Matrix A, Matrix B, Matrix C)
{
    // Each thread computes one element of C
    // by accumulating results into Cvalue
    float Cvalue = 0;
    int row = blockIdx.y * blockDim.y + threadIdx.y;
    int col = blockIdx.x * blockDim.x + threadIdx.x;
    for (int e = 0; e < A.width; ++e)
        Cvalue += A.elements[row * A.width + e]
                * B.elements[e * B.width + col];
    C.elements[row * C.width + col] = Cvalue;
}
_images/matrix-multiplication-without-shared-memory.png

Figure 8 Matrix Multiplication without Shared Memory
图 8 矩阵乘法没有共享内存 

The following code sample is an implementation of matrix multiplication that does take advantage of shared memory. In this implementation, each thread block is responsible for computing one square sub-matrix Csub of C and each thread within the block is responsible for computing one element of Csub. As illustrated in Figure 9, Csub is equal to the product of two rectangular matrices: the sub-matrix of A of dimension (A.width, block_size) that has the same row indices as Csub, and the sub-matrix of B of dimension (block_size, A.width )that has the same column indices as Csub. In order to fit into the device’s resources, these two rectangular matrices are divided into as many square matrices of dimension block_size as necessary and Csub is computed as the sum of the products of these square matrices. Each of these products is performed by first loading the two corresponding square matrices from global memory to shared memory with one thread loading one element of each matrix, and then by having each thread compute one element of the product. Each thread accumulates the result of each of these products into a register and once done writes the result to global memory.
以下代码示例是矩阵乘法的实现,利用了共享内存。在这个实现中,每个线程块负责计算矩阵 C 的一个方形子矩阵 Csub,每个线程负责计算 Csub 的一个元素。如图 9 所示,Csub 等于两个矩形矩阵的乘积:A 的子矩阵,维度为 (A.width, block_size),具有与 Csub 相同的行索引,以及 B 的子矩阵,维度为 (block_size, A.width),具有与 Csub 相同的列索引。为了适应设备的资源,这两个矩形矩阵被分成尽可能多的维度为 block_size 的方形矩阵,Csub 被计算为这些方形矩阵的乘积之和。每个乘积首先通过将两个对应的方形矩阵从全局内存加载到共享内存中来执行,其中一个线程加载每个矩阵的一个元素,然后每个线程计算乘积的一个元素。每个线程将这些乘积的结果累加到一个寄存器中,完成后将结果写入全局内存。

By blocking the computation this way, we take advantage of fast shared memory and save a lot of global memory bandwidth since A is only read (B.width / block_size) times from global memory and B is read (A.height / block_size) times.
通过这种方式阻止计算,我们利用快速共享内存,并节省大量全局内存带宽,因为 A 仅从全局内存读取 (B.width / block_size) 次,而 B 从全局内存读取 (A.height / block_size) 次。

The Matrix type from the previous code sample is augmented with a stride field, so that sub-matrices can be efficiently represented with the same type. __device__ functions are used to get and set elements and build any sub-matrix from a matrix.
前一个代码示例中的 Matrix 类型已经增加了一个 stride 字段,以便可以使用相同类型有效地表示子矩阵。使用__device__函数来获取和设置元素,并从矩阵构建任何子矩阵。

// Matrices are stored in row-major order:
// M(row, col) = *(M.elements + row * M.stride + col)
typedef struct {
    int width;
    int height;
    int stride;
    float* elements;
} Matrix;
// Get a matrix element
__device__ float GetElement(const Matrix A, int row, int col)
{
    return A.elements[row * A.stride + col];
}
// Set a matrix element
__device__ void SetElement(Matrix A, int row, int col,
                           float value)
{
    A.elements[row * A.stride + col] = value;
}
// Get the BLOCK_SIZExBLOCK_SIZE sub-matrix Asub of A that is
// located col sub-matrices to the right and row sub-matrices down
// from the upper-left corner of A
 __device__ Matrix GetSubMatrix(Matrix A, int row, int col)
{
    Matrix Asub;
    Asub.width    = BLOCK_SIZE;
    Asub.height   = BLOCK_SIZE;
    Asub.stride   = A.stride;
    Asub.elements = &A.elements[A.stride * BLOCK_SIZE * row
                                         + BLOCK_SIZE * col];
    return Asub;
}
// Thread block size
#define BLOCK_SIZE 16
// Forward declaration of the matrix multiplication kernel
__global__ void MatMulKernel(const Matrix, const Matrix, Matrix);
// Matrix multiplication - Host code
// Matrix dimensions are assumed to be multiples of BLOCK_SIZE
void MatMul(const Matrix A, const Matrix B, Matrix C)
{
    // Load A and B to device memory
    Matrix d_A;
    d_A.width = d_A.stride = A.width; d_A.height = A.height;
    size_t size = A.width * A.height * sizeof(float);
    cudaMalloc(&d_A.elements, size);
    cudaMemcpy(d_A.elements, A.elements, size,
               cudaMemcpyHostToDevice);
    Matrix d_B;
    d_B.width = d_B.stride = B.width; d_B.height = B.height;
    size = B.width * B.height * sizeof(float);
    cudaMalloc(&d_B.elements, size);
    cudaMemcpy(d_B.elements, B.elements, size,
    cudaMemcpyHostToDevice);
    // Allocate C in device memory
    Matrix d_C;
    d_C.width = d_C.stride = C.width; d_C.height = C.height;
    size = C.width * C.height * sizeof(float);
    cudaMalloc(&d_C.elements, size);
    // Invoke kernel
    dim3 dimBlock(BLOCK_SIZE, BLOCK_SIZE);
    dim3 dimGrid(B.width / dimBlock.x, A.height / dimBlock.y);
    MatMulKernel<<<dimGrid, dimBlock>>>(d_A, d_B, d_C);
    // Read C from device memory
    cudaMemcpy(C.elements, d_C.elements, size,
               cudaMemcpyDeviceToHost);
    // Free device memory
    cudaFree(d_A.elements);
    cudaFree(d_B.elements);
    cudaFree(d_C.elements);
}
// Matrix multiplication kernel called by MatMul()
 __global__ void MatMulKernel(Matrix A, Matrix B, Matrix C)
{
    // Block row and column
    int blockRow = blockIdx.y;
    int blockCol = blockIdx.x;
    // Each thread block computes one sub-matrix Csub of C
    Matrix Csub = GetSubMatrix(C, blockRow, blockCol);
    // Each thread computes one element of Csub
    // by accumulating results into Cvalue
    float Cvalue = 0;
    // Thread row and column within Csub
    int row = threadIdx.y;
    int col = threadIdx.x;
    // Loop over all the sub-matrices of A and B that are
    // required to compute Csub
    // Multiply each pair of sub-matrices together
    // and accumulate the results
    for (int m = 0; m < (A.width / BLOCK_SIZE); ++m) {
        // Get sub-matrix Asub of A
        Matrix Asub = GetSubMatrix(A, blockRow, m);
        // Get sub-matrix Bsub of B
        Matrix Bsub = GetSubMatrix(B, m, blockCol);
        // Shared memory used to store Asub and Bsub respectively
        __shared__ float As[BLOCK_SIZE][BLOCK_SIZE];
        __shared__ float Bs[BLOCK_SIZE][BLOCK_SIZE];
        // Load Asub and Bsub from device memory to shared memory
        // Each thread loads one element of each sub-matrix
        As[row][col] = GetElement(Asub, row, col);
        Bs[row][col] = GetElement(Bsub, row, col);
        // Synchronize to make sure the sub-matrices are loaded
        // before starting the computation
        __syncthreads();
        // Multiply Asub and Bsub together
        for (int e = 0; e < BLOCK_SIZE; ++e)
            Cvalue += As[row][e] * Bs[e][col];
        // Synchronize to make sure that the preceding
        // computation is done before loading two new
        // sub-matrices of A and B in the next iteration
        __syncthreads();
    }
    // Write Csub to device memory
    // Each thread writes one element
    SetElement(Csub, row, col, Cvalue);
}
_images/matrix-multiplication-with-shared-memory.png

Figure 9 Matrix Multiplication with Shared Memory
图 9 使用共享内存的矩阵乘法 

3.2.5. Distributed Shared Memory
3.2.5. 分布式共享内存 

Thread block clusters introduced in compute capability 9.0 provide the ability for threads in a thread block cluster to access shared memory of all the participating thread blocks in a cluster. This partitioned shared memory is called Distributed Shared Memory, and the corresponding address space is called Distributed shared memory address space. Threads that belong to a thread block cluster, can read, write or perform atomics in the distributed address space, regardless whether the address belongs to the local thread block or a remote thread block. Whether a kernel uses distributed shared memory or not, the shared memory size specifications, static or dynamic is still per thread block. The size of distributed shared memory is just the number of thread blocks per cluster multiplied by the size of shared memory per thread block.
在计算能力 9.0 中引入的线程块集群提供了线程块集群中的线程访问集群中所有参与线程块的共享内存的能力。这种分区共享内存称为分布式共享内存,相应的地址空间称为分布式共享内存地址空间。属于线程块集群的线程可以在分布式地址空间中读取、写入或执行原子操作,无论地址属于本地线程块还是远程线程块。无论内核是否使用分布式共享内存,共享内存大小规格(静态或动态)仍然是每个线程块的。分布式共享内存的大小只是每个集群中的线程块数乘以每个线程块的共享内存大小。

Accessing data in distributed shared memory requires all the thread blocks to exist. A user can guarantee that all thread blocks have started executing using cluster.sync() from Cluster Group API. The user also needs to ensure that all distributed shared memory operations happen before the exit of a thread block, e.g., if a remote thread block is trying to read a given thread block’s shared memory, user needs to ensure that the shared memory read by remote thread block is completed before it can exit.
在分布式共享内存中访问数据需要所有线程块存在。用户可以使用 Cluster Group API 中的 cluster.sync() 来确保所有线程块已经开始执行。用户还需要确保所有分布式共享内存操作发生在线程块退出之前,例如,如果远程线程块尝试读取给定线程块的共享内存,用户需要确保远程线程块读取的共享内存在其退出之前已经完成。

CUDA provides a mechanism to access to distributed shared memory, and applications can benefit from leveraging its capabilities. Lets look at a simple histogram computation and how to optimize it on the GPU using thread block cluster. A standard way of computing histograms is do the computation in the shared memory of each thread block and then perform global memory atomics. A limitation of this approach is the shared memory capacity. Once the histogram bins no longer fit in the shared memory, a user needs to directly compute histograms and hence the atomics in the global memory. With distributed shared memory, CUDA provides an intermediate step, where a depending on the histogram bins size, histogram can be computed in shared memory, distributed shared memory or global memory directly.
CUDA 提供了一种访问分布式共享内存的机制,应用程序可以从利用其功能中受益。让我们看一个简单的直方图计算以及如何在 GPU 上使用线程块集群对其进行优化。计算直方图的标准方法是在每个线程块的共享内存中进行计算,然后执行全局内存原子操作。这种方法的局限性在于共享内存的容量。一旦直方图的箱子不再适合共享内存,用户就需要直接在全局内存中计算直方图,因此需要原子操作。通过分布式共享内存,CUDA 提供了一个中间步骤,根据直方图箱的大小,直方图可以在共享内存、分布式共享内存或全局内存中直接计算。

The CUDA kernel example below shows how to compute histograms in shared memory or distributed shared memory, depending on the number of histogram bins.
下面的 CUDA 内核示例显示了如何在共享内存或分布式共享内存中计算直方图,具体取决于直方图箱的数量。

#include <cooperative_groups.h>

// Distributed Shared memory histogram kernel
__global__ void clusterHist_kernel(int *bins, const int nbins, const int bins_per_block, const int *__restrict__ input,
                                   size_t array_size)
{
  extern __shared__ int smem[];
  namespace cg = cooperative_groups;
  int tid = cg::this_grid().thread_rank();

  // Cluster initialization, size and calculating local bin offsets.
  cg::cluster_group cluster = cg::this_cluster();
  unsigned int clusterBlockRank = cluster.block_rank();
  int cluster_size = cluster.dim_blocks().x;

  for (int i = threadIdx.x; i < bins_per_block; i += blockDim.x)
  {
    smem[i] = 0; //Initialize shared memory histogram to zeros
  }

  // cluster synchronization ensures that shared memory is initialized to zero in
  // all thread blocks in the cluster. It also ensures that all thread blocks
  // have started executing and they exist concurrently.
  cluster.sync();

  for (int i = tid; i < array_size; i += blockDim.x * gridDim.x)
  {
    int ldata = input[i];

    //Find the right histogram bin.
    int binid = ldata;
    if (ldata < 0)
      binid = 0;
    else if (ldata >= nbins)
      binid = nbins - 1;

    //Find destination block rank and offset for computing
    //distributed shared memory histogram
    int dst_block_rank = (int)(binid / bins_per_block);
    int dst_offset = binid % bins_per_block;

    //Pointer to target block shared memory
    int *dst_smem = cluster.map_shared_rank(smem, dst_block_rank);

    //Perform atomic update of the histogram bin
    atomicAdd(dst_smem + dst_offset, 1);
  }

  // cluster synchronization is required to ensure all distributed shared
  // memory operations are completed and no thread block exits while
  // other thread blocks are still accessing distributed shared memory
  cluster.sync();

  // Perform global memory histogram, using the local distributed memory histogram
  int *lbins = bins + cluster.block_rank() * bins_per_block;
  for (int i = threadIdx.x; i < bins_per_block; i += blockDim.x)
  {
    atomicAdd(&lbins[i], smem[i]);
  }
}

The above kernel can be launched at runtime with a cluster size depending on the amount of distributed shared memory required. If histogram is small enough to fit in shared memory of just one block, user can launch kernel with cluster size 1. The code snippet below shows how to launch a cluster kernel dynamically based depending on shared memory requirements.
上述内核可以在运行时启动,集群大小取决于所需的分布式共享内存量。如果直方图足够小,可以适应一个块的共享内存,用户可以使用集群大小为 1 启动内核。下面的代码片段显示了如何根据共享内存需求动态启动集群内核。

// Launch via extensible launch
{
  cudaLaunchConfig_t config = {0};
  config.gridDim = array_size / threads_per_block;
  config.blockDim = threads_per_block;

  // cluster_size depends on the histogram size.
  // ( cluster_size == 1 ) implies no distributed shared memory, just thread block local shared memory
  int cluster_size = 2; // size 2 is an example here
  int nbins_per_block = nbins / cluster_size;

  //dynamic shared memory size is per block.
  //Distributed shared memory size =  cluster_size * nbins_per_block * sizeof(int)
  config.dynamicSmemBytes = nbins_per_block * sizeof(int);

  CUDA_CHECK(::cudaFuncSetAttribute((void *)clusterHist_kernel, cudaFuncAttributeMaxDynamicSharedMemorySize, config.dynamicSmemBytes));

  cudaLaunchAttribute attribute[1];
  attribute[0].id = cudaLaunchAttributeClusterDimension;
  attribute[0].val.clusterDim.x = cluster_size;
  attribute[0].val.clusterDim.y = 1;
  attribute[0].val.clusterDim.z = 1;

  config.numAttrs = 1;
  config.attrs = attribute;

  cudaLaunchKernelEx(&config, clusterHist_kernel, bins, nbins, nbins_per_block, input, array_size);
}

3.2.6. Page-Locked Host Memory
3.2.6. 页面锁定主机内存 

The runtime provides functions to allow the use of page-locked (also known as pinned) host memory (as opposed to regular pageable host memory allocated by malloc()):
运行时提供了函数,允许使用页面锁定(也称为固定)主机内存(与由 malloc() 分配的常规可分页主机内存相对)

  • cudaHostAlloc() and cudaFreeHost() allocate and free page-locked host memory;
    cudaHostAlloc()cudaFreeHost() 分配和释放页面锁定的主机内存;

  • cudaHostRegister() page-locks a range of memory allocated by malloc() (see reference manual for limitations).
    cudaHostRegister() 页面锁定由 malloc() 分配的内存范围(请参阅参考手册以了解限制)。

Using page-locked host memory has several benefits:
使用页面锁定的主机内存有几个好处:

  • Copies between page-locked host memory and device memory can be performed concurrently with kernel execution for some devices as mentioned in Asynchronous Concurrent Execution.
    在“异步并发执行”中提到,某些设备可以在内核执行期间同时在页面锁定的主机内存和设备内存之间执行拷贝操作。

  • On some devices, page-locked host memory can be mapped into the address space of the device, eliminating the need to copy it to or from device memory as detailed in Mapped Memory.
    在某些设备上,页面锁定的主机内存可以映射到设备的地址空间中,无需将其复制到设备内存中,详细信息请参阅映射内存。

  • On systems with a front-side bus, bandwidth between host memory and device memory is higher if host memory is allocated as page-locked and even higher if in addition it is allocated as write-combining as described in Write-Combining Memory.
    在具有前端总线的系统上,如果将主机内存分配为页面锁定,则主机内存与设备内存之间的带宽更高,如果此外还将其分配为写组合内存,则带宽更高,如《写组合内存》中所述。

Note 注意

Page-locked host memory is not cached on non I/O coherent Tegra devices. Also, cudaHostRegister() is not supported on non I/O coherent Tegra devices.
页面锁定的主机内存在非 I/O 一致的 Tegra 设备上不被缓存。另外,在非 I/O 一致的 Tegra 设备上不支持 cudaHostRegister()

The simple zero-copy CUDA sample comes with a detailed document on the page-locked memory APIs.
简单的零拷贝 CUDA 示例附带了关于页锁定内存 API 的详细文档。

3.2.6.1. Portable Memory
3.2.6.1. 可移植内存 

A block of page-locked memory can be used in conjunction with any device in the system (see Multi-Device System for more details on multi-device systems), but by default, the benefits of using page-locked memory described above are only available in conjunction with the device that was current when the block was allocated (and with all devices sharing the same unified address space, if any, as described in Unified Virtual Address Space). To make these advantages available to all devices, the block needs to be allocated by passing the flag cudaHostAllocPortable to cudaHostAlloc() or page-locked by passing the flag cudaHostRegisterPortable to cudaHostRegister().
页面锁定内存块可以与系统中的任何设备一起使用(有关多设备系统的更多详细信息,请参见多设备系统),但默认情况下,上述使用页面锁定内存的好处仅在与分配该块时当前设备一起使用时才可用(以及所有设备共享相同统一地址空间(如果有的话,如统一虚拟地址空间中所述)。要使所有设备能够利用这些优势,需要通过将标志 cudaHostAllocPortable 传递给 cudaHostAlloc() 来分配该块,或通过将标志 cudaHostRegisterPortable 传递给 cudaHostRegister() 来锁定页面。

3.2.6.2. Write-Combining Memory
3.2.6.2. 写结合内存 

By default page-locked host memory is allocated as cacheable. It can optionally be allocated as write-combining instead by passing flag cudaHostAllocWriteCombined to cudaHostAlloc(). Write-combining memory frees up the host’s L1 and L2 cache resources, making more cache available to the rest of the application. In addition, write-combining memory is not snooped during transfers across the PCI Express bus, which can improve transfer performance by up to 40%.
默认情况下,锁定在页面上的主机内存被分配为可缓存的。可以选择将其分配为写组合,而不是通过将标志 cudaHostAllocWriteCombined 传递给 cudaHostAlloc() 。写组合内存释放主机的 L1 和 L2 缓存资源,使更多缓存可用于应用程序的其余部分。此外,写组合内存在通过 PCI Express 总线传输时不会被监视,这可以将传输性能提高高达 40%。

Reading from write-combining memory from the host is prohibitively slow, so write-combining memory should in general be used for memory that the host only writes to.
从主机的写组合内存中读取数据速度非常慢,因此一般应该将写组合内存用于主机只写入的内存。

Using CPU atomic instructions on WC memory should be avoided because not all CPU implementations guarantee that functionality.
在 WC 内存上使用 CPU 原子指令应该被避免,因为并非所有 CPU 实现都保证该功能。

3.2.6.3. Mapped Memory
3.2.6.3. 映射内存 

A block of page-locked host memory can also be mapped into the address space of the device by passing flag cudaHostAllocMapped to cudaHostAlloc() or by passing flag cudaHostRegisterMapped to cudaHostRegister(). Such a block has therefore in general two addresses: one in host memory that is returned by cudaHostAlloc() or malloc(), and one in device memory that can be retrieved using cudaHostGetDevicePointer() and then used to access the block from within a kernel. The only exception is for pointers allocated with cudaHostAlloc() and when a unified address space is used for the host and the device as mentioned in Unified Virtual Address Space.
一个页面锁定的主机内存块也可以通过将标志 cudaHostAllocMapped 传递给 cudaHostAlloc() 或通过将标志 cudaHostRegisterMapped 传递给 cudaHostRegister() 映射到设备的地址空间中。因此,这样一个块通常有两个地址:一个在主机内存中,由 cudaHostAlloc()malloc() 返回,另一个在设备内存中,可以使用 cudaHostGetDevicePointer() 检索,然后用于从内核中访问该块。唯一的例外是使用 cudaHostAlloc() 分配的指针以及在主机和设备上使用统一地址空间时提到的情况。

Accessing host memory directly from within a kernel does not provide the same bandwidth as device memory, but does have some advantages:
直接从内核中访问主机内存并不像访问设备内存那样提供相同的带宽,但确实具有一些优势:

  • There is no need to allocate a block in device memory and copy data between this block and the block in host memory; data transfers are implicitly performed as needed by the kernel;
    不需要在设备内存中分配块并在主机内存中的块之间复制数据;数据传输会根据内核的需要隐式执行;

  • There is no need to use streams (see Concurrent Data Transfers) to overlap data transfers with kernel execution; the kernel-originated data transfers automatically overlap with kernel execution.
    不需要使用流(请参见并发数据传输)来重叠数据传输和内核执行;内核发起的数据传输会自动与内核执行重叠。

Since mapped page-locked memory is shared between host and device however, the application must synchronize memory accesses using streams or events (see Asynchronous Concurrent Execution) to avoid any potential read-after-write, write-after-read, or write-after-write hazards.
由于映射的页面锁定内存在主机和设备之间共享,因此应用程序必须使用流或事件(请参阅异步并发执行)同步内存访问,以避免任何潜在的读后写、写后读或写后写危险。

To be able to retrieve the device pointer to any mapped page-locked memory, page-locked memory mapping must be enabled by calling cudaSetDeviceFlags() with the cudaDeviceMapHost flag before any other CUDA call is performed. Otherwise, cudaHostGetDevicePointer() will return an error.
要能够检索到任何映射的页锁定内存的设备指针,必须在执行任何其他 CUDA 调用之前通过调用 cudaSetDeviceFlags() 启用页锁定内存映射,并带有 cudaDeviceMapHost 标志。否则, cudaHostGetDevicePointer() 将返回错误。

cudaHostGetDevicePointer() also returns an error if the device does not support mapped page-locked host memory. Applications may query this capability by checking the canMapHostMemory device property (see Device Enumeration), which is equal to 1 for devices that support mapped page-locked host memory.
cudaHostGetDevicePointer() 如果设备不支持映射的页面锁定主机内存,也会返回错误。应用程序可以通过检查 canMapHostMemory 设备属性(请参阅设备枚举)来查询此功能,对于支持映射的页面锁定主机内存的设备,该属性值为 1。

Note that atomic functions (see Atomic Functions) operating on mapped page-locked memory are not atomic from the point of view of the host or other devices.
请注意,对映射的页面锁定内存执行的原子函数(请参阅原子函数)在主机或其他设备的角度来看并不是原子的。

Also note that CUDA runtime requires that 1-byte, 2-byte, 4-byte, and 8-byte naturally aligned loads and stores to host memory initiated from the device are preserved as single accesses from the point of view of the host and other devices. On some platforms, atomics to memory may be broken by the hardware into separate load and store operations. These component load and store operations have the same requirements on preservation of naturally aligned accesses. As an example, the CUDA runtime does not support a PCI Express bus topology where a PCI Express bridge splits 8-byte naturally aligned writes into two 4-byte writes between the device and the host.
请注意,CUDA 运行时要求从设备发起的对主机内存的 1 字节、2 字节、4 字节和 8 字节自然对齐的加载和存储在主机和其他设备的视角下被保留为单个访问。在某些平台上,对内存的原子操作可能会被硬件分解为单独的加载和存储操作。这些组件加载和存储操作对自然对齐访问的保留具有相同的要求。例如,CUDA 运行时不支持 PCI Express 总线拓扑结构,其中 PCI Express 桥将 8 字节自然对齐写入拆分为设备和主机之间的两个 4 字节写入。

3.2.7. Memory Synchronization Domains
3.2.7. 内存同步域 

3.2.7.1. Memory Fence Interference
3.2.7.1. 内存栅干扰 

Some CUDA applications may see degraded performance due to memory fence/flush operations waiting on more transactions than those necessitated by the CUDA memory consistency model.
一些 CUDA 应用程序可能会因为内存栅栏/刷新操作等待的事务比 CUDA 内存一致性模型所需的事务更多而导致性能下降。

__managed__ int x = 0;
__device__  cuda::atomic<int, cuda::thread_scope_device> a(0);
__managed__ cuda::atomic<int, cuda::thread_scope_system> b(0);

Thread 1 (SM) 线程 1(SM)

x = 1;
a = 1;

Thread 2 (SM) 线程 2(SM)

while (a != 1) ;
assert(x == 1);
b = 1;

Thread 3 (CPU) 线程 3(CPU)

while (b != 1) ;
assert(x == 1);

Consider the example above. The CUDA memory consistency model guarantees that the asserted condition will be true, so the write to x from thread 1 must be visible to thread 3, before the write to b from thread 2.
考虑上面的示例。CUDA 内存一致性模型保证了所断言的条件将为真,因此线程 1 的对 x 的写入必须在线程 2 的对 b 的写入之前对线程 3 可见。

The memory ordering provided by the release and acquire of a is only sufficient to make x visible to thread 2, not thread 3, as it is a device-scope operation. The system-scope ordering provided by release and acquire of b, therefore, needs to ensure not only writes issued from thread 2 itself are visible to thread 3, but also writes from other threads that are visible to thread 2. This is known as cumulativity. As the GPU cannot know at the time of execution which writes have been guaranteed at the source level to be visible and which are visible only by chance timing, it must cast a conservatively wide net for in-flight memory operations.
通过 a 的释放和获取提供的内存排序仅足以使 x 对线程 2 可见,而不对线程 3 可见,因为它是设备范围的操作。因此,通过 b 的释放和获取提供的系统范围排序需要确保不仅线程 2 本身发出的写入对线程 3 可见,还需要确保其他线程发出的对线程 2 可见的写入也对线程 3 可见。这被称为累积性。由于 GPU 无法在执行时知道哪些写入已经在源级别上保证可见,哪些仅仅是由于偶然的时序而可见,因此必须对正在进行的内存操作进行保守地广泛覆盖。

This sometimes leads to interference: because the GPU is waiting on memory operations it is not required to at the source level, the fence/flush may take longer than necessary.
有时会导致干扰:因为 GPU 正在等待在源级别上不需要的内存操作,所以栅栏/刷新可能比必要的时间更长。

Note that fences may occur explicitly as intrinsics or atomics in code, like in the example, or implicitly to implement synchronizes-with relationships at task boundaries.
请注意,栅栏可能会在代码中显式地作为内在函数或原子操作出现,就像示例中一样,也可能隐式地在任务边界实现同步关系。

A common example is when a kernel is performing computation in local GPU memory, and a parallel kernel (e.g. from NCCL) is performing communications with a peer. Upon completion, the local kernel will implicitly flush its writes to satisfy any synchronizes-with relationships to downstream work. This may unnecessarily wait, fully or partially, on slower nvlink or PCIe writes from the communication kernel.
一个常见的例子是内核在本地 GPU 内存中执行计算,而并行内核(例如来自 NCCL)正在与对等体进行通信。完成后,本地内核将隐式刷新其写入,以满足到下游工作的任何同步关系。这可能会在通信内核的 nvlink 或 PCIe 写入较慢时不必要地等待,完全或部分地。

3.2.7.2. Isolating Traffic with Domains
3.2.7.2. 使用域名隔离流量 

Beginning with Hopper architecture GPUs and CUDA 12.0, the memory synchronization domains feature provides a way to alleviate such interference. In exchange for explicit assistance from code, the GPU can reduce the net cast by a fence operation. Each kernel launch is given a domain ID. Writes and fences are tagged with the ID, and a fence will only order writes matching the fence’s domain. In the concurrent compute vs communication example, the communication kernels can be placed in a different domain.
从霍珀架构 GPU 和 CUDA 12.0 开始,内存同步域功能提供了一种缓解这种干扰的方式。通过代码的显式协助,GPU 可以通过栅栏操作减少净投射。每个内核启动都会被赋予一个域 ID。写入和栅栏都会带有该 ID,并且栅栏只会对与栅栏域匹配的写入进行排序。在并发计算与通信示例中,通信内核可以放置在不同的域中。

When using domains, code must abide by the rule that ordering or synchronization between distinct domains on the same GPU requires system-scope fencing. Within a domain, device-scope fencing remains sufficient. This is necessary for cumulativity as one kernel’s writes will not be encompassed by a fence issued from a kernel in another domain. In essence, cumulativity is satisfied by ensuring that cross-domain traffic is flushed to the system scope ahead of time.
在使用域时,代码必须遵守一个规则,即在同一 GPU 上的不同域之间的排序或同步需要系统范围的围栏。在一个域内,设备范围的围栏仍然足够。这对于累积性是必要的,因为一个内核的写入不会被来自另一个域内核的围栏所包含。实质上,通过确保跨域流量提前刷新到系统范围,可以满足累积性。

Note that this modifies the definition of thread_scope_device. However, because kernels will default to domain 0 as described below, backward compatibility is maintained.
请注意,这会修改 thread_scope_device 的定义。但是,由于内核将默认为域 0,如下所述,向后兼容性得以保持。

3.2.7.3. Using Domains in CUDA
3.2.7.3. 在 CUDA 中使用域 

Domains are accessible via the new launch attributes cudaLaunchAttributeMemSyncDomain and cudaLaunchAttributeMemSyncDomainMap. The former selects between logical domains cudaLaunchMemSyncDomainDefault and cudaLaunchMemSyncDomainRemote, and the latter provides a mapping from logical to physical domains. The remote domain is intended for kernels performing remote memory access in order to isolate their memory traffic from local kernels. Note, however, the selection of a particular domain does not affect what memory access a kernel may legally perform.
域可以通过新的启动属性 cudaLaunchAttributeMemSyncDomaincudaLaunchAttributeMemSyncDomainMap 访问。前者在逻辑域 cudaLaunchMemSyncDomainDefaultcudaLaunchMemSyncDomainRemote 之间进行选择,后者提供逻辑到物理域的映射。远程域旨在供执行远程内存访问的内核使用,以便将其内存流量与本地内核隔离开来。但请注意,选择特定域并不影响内核可以合法执行的内存访问。

The domain count can be queried via device attribute cudaDevAttrMemSyncDomainCount. Hopper has 4 domains. To facilitate portable code, domains functionality can be used on all devices and CUDA will report a count of 1 prior to Hopper.
域计数可以通过设备属性 cudaDevAttrMemSyncDomainCount 查询。Hopper 有 4 个域。为了方便可移植的代码,域功能可以在所有设备上使用,并且 CUDA 将在 Hopper 之前报告计数为 1。

Having logical domains eases application composition. An individual kernel launch at a low level in the stack, such as from NCCL, can select a semantic logical domain without concern for the surrounding application architecture. Higher levels can steer logical domains using the mapping. The default value for the logical domain if it is not set is the default domain, and the default mapping is to map the default domain to 0 and the remote domain to 1 (on GPUs with more than 1 domain). Specific libraries may tag launches with the remote domain in CUDA 12.0 and later; for example, NCCL 2.16 will do so. Together, this provides a beneficial use pattern for common applications out of the box, with no code changes needed in other components, frameworks, or at application level. An alternative use pattern, for example in an application using nvshmem or with no clear separation of kernel types, could be to partition parallel streams. Stream A may map both logical domains to physical domain 0, stream B to 1, and so on.
具有逻辑域可简化应用程序组合。在堆栈中的低级别处(例如来自 NCCL 的)进行单个内核启动可以选择语义逻辑域,而无需考虑周围的应用程序架构。较高级别可以使用映射来引导逻辑域。如果未设置逻辑域的默认值为默认域,且默认映射是将默认域映射到 0,将远程域映射到 1(对于具有多个域的 GPU)。特定库可能会在 CUDA 12.0 及更高版本中使用远程域标记启动;例如,NCCL 2.16 将这样做。总体而言,这为常见应用程序提供了一个有益的开箱即用模式,无需在其他组件、框架或应用程序级别进行代码更改。例如,在使用 nvshmem 的应用程序中或者没有明确分离内核类型的情况下,另一种使用模式可能是对并行流进行分区。流 A 可以将两个逻辑域都映射到物理域 0,流 B 映射到 1,依此类推。

// Example of launching a kernel with the remote logical domain
cudaLaunchAttribute domainAttr;
domainAttr.id = cudaLaunchAttrMemSyncDomain;
domainAttr.val = cudaLaunchMemSyncDomainRemote;
cudaLaunchConfig_t config;
// Fill out other config fields
config.attrs = &domainAttr;
config.numAttrs = 1;
cudaLaunchKernelEx(&config, myKernel, kernelArg1, kernelArg2...);
// Example of setting a mapping for a stream
// (This mapping is the default for streams starting on Hopper if not
// explicitly set, and provided for illustration)
cudaLaunchAttributeValue mapAttr;
mapAttr.memSyncDomainMap.default_ = 0;
mapAttr.memSyncDomainMap.remote = 1;
cudaStreamSetAttribute(stream, cudaLaunchAttributeMemSyncDomainMap, &mapAttr);
// Example of mapping different streams to different physical domains, ignoring
// logical domain settings
cudaLaunchAttributeValue mapAttr;
mapAttr.memSyncDomainMap.default_ = 0;
mapAttr.memSyncDomainMap.remote = 0;
cudaStreamSetAttribute(streamA, cudaLaunchAttributeMemSyncDomainMap, &mapAttr);
mapAttr.memSyncDomainMap.default_ = 1;
mapAttr.memSyncDomainMap.remote = 1;
cudaStreamSetAttribute(streamB, cudaLaunchAttributeMemSyncDomainMap, &mapAttr);

As with other launch attributes, these are exposed uniformly on CUDA streams, individual launches using cudaLaunchKernelEx, and kernel nodes in CUDA graphs. A typical use would set the mapping at stream level and the logical domain at launch level (or bracketing a section of stream use) as described above.
与其他启动属性一样,这些属性在 CUDA 流上以统一方式公开,使用 cudaLaunchKernelEx 进行单独启动,并在 CUDA 图中的内核节点上。典型用法将在流级别设置映射,启动级别设置逻辑域(或在流使用部分上加括号)如上所述。

Both attributes are copied to graph nodes during stream capture. Graphs take both attributes from the node itself, essentially an indirect way of specifying a physical domain. Domain-related attributes set on the stream a graph is launched into are not used in execution of the graph.
在流捕获期间,两个属性都会被复制到图节点。图形从节点本身获取这两个属性,本质上是指定物理域的一种间接方式。在启动图形的流上设置的与域相关的属性在执行图形时不会被使用。

3.2.8. Asynchronous Concurrent Execution
3.2.8. 异步并发执行 

CUDA exposes the following operations as independent tasks that can operate concurrently with one another:
CUDA 将以下操作作为独立任务公开,这些任务可以彼此并发运行:

  • Computation on the host; 主机上的计算;

  • Computation on the device;
    设备上的计算;

  • Memory transfers from the host to the device;
    从主机到设备的内存传输;

  • Memory transfers from the device to the host;
    从设备到主机的内存传输;

  • Memory transfers within the memory of a given device;
    给定设备内存中的内存传输;

  • Memory transfers among devices.
    设备之间的内存传输。

The level of concurrency achieved between these operations will depend on the feature set and compute capability of the device as described below.
这些操作之间实现的并发级别将取决于设备的功能集和计算能力,如下所述。

3.2.8.1. Concurrent Execution between Host and Device
3.2.8.1. 主机和设备之间的并发执行 

Concurrent host execution is facilitated through asynchronous library functions that return control to the host thread before the device completes the requested task. Using asynchronous calls, many device operations can be queued up together to be executed by the CUDA driver when appropriate device resources are available. This relieves the host thread of much of the responsibility to manage the device, leaving it free for other tasks. The following device operations are asynchronous with respect to the host:
通过异步库函数实现并发主机执行,这些函数在设备完成请求的任务之前将控制权返回给主机线程。使用异步调用,许多设备操作可以一起排队,以便在适当的设备资源可用时由 CUDA 驱动程序执行。这减轻了主机线程管理设备的许多责任,使其可以用于其他任务。以下设备操作与主机异步:

  • Kernel launches; 内核启动;

  • Memory copies within a single device’s memory;
    单个设备内存中的内存复制;

  • Memory copies from host to device of a memory block of 64 KB or less;
    主机到设备的内存块大小为 64 KB 或更小的内存复制;

  • Memory copies performed by functions that are suffixed with Async;
    由以 Async 结尾的函数执行的内存复制操作;

  • Memory set function calls.
    内存设置函数调用。

Programmers can globally disable asynchronicity of kernel launches for all CUDA applications running on a system by setting the CUDA_LAUNCH_BLOCKING environment variable to 1. This feature is provided for debugging purposes only and should not be used as a way to make production software run reliably.
程序员可以通过将 CUDA_LAUNCH_BLOCKING 环境变量设置为 1 来全局禁用系统上运行的所有 CUDA 应用程序的内核启动的异步性。此功能仅用于调试目的,不应作为使生产软件可靠运行的方法。

Kernel launches are synchronous if hardware counters are collected via a profiler (Nsight, Visual Profiler) unless concurrent kernel profiling is enabled. Async memory copies might also be synchronous if they involve host memory that is not page-locked.
如果通过分析器(Nsight、Visual Profiler)收集硬件计数器,则内核启动是同步的,除非启用了并发内核分析。 Async 内存复制可能也是同步的,如果涉及未锁定页面的主机内存。

3.2.8.2. Concurrent Kernel Execution
3.2.8.2. 并发内核执行 

Some devices of compute capability 2.x and higher can execute multiple kernels concurrently. Applications may query this capability by checking the concurrentKernels device property (see Device Enumeration), which is equal to 1 for devices that support it.
某些计算能力为 2.x 及更高版本的设备可以同时执行多个内核。应用程序可以通过检查 concurrentKernels 设备属性(请参阅设备枚举)来查询此功能,对于支持此功能的设备,该属性值为 1。

The maximum number of kernel launches that a device can execute concurrently depends on its compute capability and is listed in Table 21.
设备可以同时执行的内核启动次数的最大值取决于其计算能力,并列在表 21 中。

A kernel from one CUDA context cannot execute concurrently with a kernel from another CUDA context. The GPU may time slice to provide forward progress to each context. If a user wants to run kernels from multiple process simultaneously on the SM, one must enable MPS.
一个 CUDA 上下文的内核不能与另一个 CUDA 上下文的内核并发执行。GPU 可能会进行时间切片,以为每个上下文提供前进进度。如果用户希望在 SM 上同时运行多个进程的内核,必须启用 MPS。

Kernels that use many textures or a large amount of local memory are less likely to execute concurrently with other kernels.
使用许多纹理或大量本地内存的内核不太可能与其他内核同时执行。

3.2.8.3. Overlap of Data Transfer and Kernel Execution
3.2.8.3. 数据传输和内核执行的重叠 

Some devices can perform an asynchronous memory copy to or from the GPU concurrently with kernel execution. Applications may query this capability by checking the asyncEngineCount device property (see Device Enumeration), which is greater than zero for devices that support it. If host memory is involved in the copy, it must be page-locked.
一些设备可以在内核执行的同时与 GPU 并行执行异步内存复制。应用程序可以通过检查 asyncEngineCount 设备属性(请参阅设备枚举)来查询此功能,对于支持此功能的设备,该属性大于零。如果主机内存涉及到复制,它必须是页面锁定的。

It is also possible to perform an intra-device copy simultaneously with kernel execution (on devices that support the concurrentKernels device property) and/or with copies to or from the device (for devices that support the asyncEngineCount property). Intra-device copies are initiated using the standard memory copy functions with destination and source addresses residing on the same device.
还可以同时执行设备内拷贝和内核执行(对支持 concurrentKernels 设备属性的设备)和/或与设备之间的拷贝(对支持 asyncEngineCount 属性的设备)。设备内拷贝是使用标准内存拷贝函数发起的,目标和源地址都位于同一设备上。

3.2.8.4. Concurrent Data Transfers
3.2.8.4. 并发数据传输 

Some devices of compute capability 2.x and higher can overlap copies to and from the device. Applications may query this capability by checking the asyncEngineCount device property (see Device Enumeration), which is equal to 2 for devices that support it. In order to be overlapped, any host memory involved in the transfers must be page-locked.
某些计算能力为 2.x 及更高版本的设备可以重叠到设备和从设备复制。应用程序可以通过检查 asyncEngineCount 设备属性(请参阅设备枚举)来查询此功能,对于支持此功能的设备,该属性等于 2。为了实现重叠,涉及传输的任何主机内存都必须是页面锁定的。

3.2.8.5. Streams 3.2.8.5. 流 

Applications manage the concurrent operations described above through streams. A stream is a sequence of commands (possibly issued by different host threads) that execute in order. Different streams, on the other hand, may execute their commands out of order with respect to one another or concurrently; this behavior is not guaranteed and should therefore not be relied upon for correctness (for example, inter-kernel communication is undefined). The commands issued on a stream may execute when all the dependencies of the command are met. The dependencies could be previously launched commands on same stream or dependencies from other streams. The successful completion of synchronize call guarantees that all the commands launched are completed.
应用程序通过流来管理上述描述的并发操作。流是一系列命令(可能由不同的主机线程发出),按顺序执行。另一方面,不同的流可能会以不同的顺序或并发地执行它们的命令;这种行为不能保证,因此不应依赖于正确性(例如,内核间通信是未定义的)。在流上发出的命令可能在命令的所有依赖关系满足时执行。这些依赖关系可以是同一流上先前启动的命令,也可以是其他流的依赖关系。同步调用的成功完成保证了所有启动的命令都已完成。

3.2.8.5.1. Creation and Destruction of Streams
3.2.8.5.1. 流的创建和销毁 

A stream is defined by creating a stream object and specifying it as the stream parameter to a sequence of kernel launches and host <-> device memory copies. The following code sample creates two streams and allocates an array hostPtr of float in page-locked memory.
通过创建流对象并将其指定为一系列内核启动和主机 <-> 设备内存复制的流参数来定义流。以下代码示例创建两个流并在页锁定内存中分配一个大小为 floathostPtr 数组。

cudaStream_t stream[2];
for (int i = 0; i < 2; ++i)
    cudaStreamCreate(&stream[i]);
float* hostPtr;
cudaMallocHost(&hostPtr, 2 * size);

Each of these streams is defined by the following code sample as a sequence of one memory copy from host to device, one kernel launch, and one memory copy from device to host:
每个流由以下代码示例定义为从主机到设备的一次内存复制、一个内核启动和一个从设备到主机的内存复制序列:

for (int i = 0; i < 2; ++i) {
    cudaMemcpyAsync(inputDevPtr + i * size, hostPtr + i * size,
                    size, cudaMemcpyHostToDevice, stream[i]);
    MyKernel <<<100, 512, 0, stream[i]>>>
          (outputDevPtr + i * size, inputDevPtr + i * size, size);
    cudaMemcpyAsync(hostPtr + i * size, outputDevPtr + i * size,
                    size, cudaMemcpyDeviceToHost, stream[i]);
}

Each stream copies its portion of input array hostPtr to array inputDevPtr in device memory, processes inputDevPtr on the device by calling MyKernel(), and copies the result outputDevPtr back to the same portion of hostPtr. Overlapping Behavior describes how the streams overlap in this example depending on the capability of the device. Note that hostPtr must point to page-locked host memory for any overlap to occur.
每个流将其输入数组的部分 hostPtr 复制到设备内存中的数组 inputDevPtr ,通过调用 MyKernel() 在设备上处理 inputDevPtr ,然后将结果 outputDevPtr 复制回 hostPtr 的相同部分。重叠行为描述了这个示例中流如何重叠,取决于设备的能力。请注意, hostPtr 必须指向页锁定的主机内存,以便发生任何重叠。

Streams are released by calling cudaStreamDestroy().
流通过调用 cudaStreamDestroy() 释放。

for (int i = 0; i < 2; ++i)
    cudaStreamDestroy(stream[i]);

In case the device is still doing work in the stream when cudaStreamDestroy() is called, the function will return immediately and the resources associated with the stream will be released automatically once the device has completed all work in the stream.
如果在调用 cudaStreamDestroy() 时设备仍在流中工作,函数将立即返回,并一旦设备完成流中的所有工作,与流相关的资源将自动释放。

3.2.8.5.2. Default Stream
3.2.8.5.2. 默认流 

Kernel launches and host <-> device memory copies that do not specify any stream parameter, or equivalently that set the stream parameter to zero, are issued to the default stream. They are therefore executed in order.
内核启动和主机 <-> 设备内存拷贝,如果没有指定任何流参数,或者将流参数设置为零,则会发送到默认流。因此,它们是按顺序执行的。

For code that is compiled using the --default-stream per-thread compilation flag (or that defines the CUDA_API_PER_THREAD_DEFAULT_STREAM macro before including CUDA headers (cuda.h and cuda_runtime.h)), the default stream is a regular stream and each host thread has its own default stream.
对于使用 --default-stream per-thread 编译标志编译的代码(或在包含 CUDA 头文件之前定义 CUDA_API_PER_THREAD_DEFAULT_STREAM 宏( cuda.hcuda_runtime.h )),默认流是常规流,每个主机线程都有自己的默认流。

Note 注意

#define CUDA_API_PER_THREAD_DEFAULT_STREAM 1 cannot be used to enable this behavior when the code is compiled by nvcc as nvcc implicitly includes cuda_runtime.h at the top of the translation unit. In this case the --default-stream per-thread compilation flag needs to be used or the CUDA_API_PER_THREAD_DEFAULT_STREAM macro needs to be defined with the -DCUDA_API_PER_THREAD_DEFAULT_STREAM=1 compiler flag.
#define CUDA_API_PER_THREAD_DEFAULT_STREAM 1 无法在代码由 nvcc 编译时用于启用此行为,因为 nvcc 隐式地在翻译单元的顶部包含了 cuda_runtime.h 。在这种情况下,需要使用 --default-stream per-thread 编译标志,或者需要使用 -DCUDA_API_PER_THREAD_DEFAULT_STREAM=1 编译标志定义 CUDA_API_PER_THREAD_DEFAULT_STREAM 宏。

For code that is compiled using the --default-stream legacy compilation flag, the default stream is a special stream called the NULL stream and each device has a single NULL stream used for all host threads. The NULL stream is special as it causes implicit synchronization as described in Implicit Synchronization.
对于使用 --default-stream legacy 编译标志编译的代码,默认流是一个特殊流,称为 NULL 流,每个设备都有一个用于所有主机线程的单个 NULL 流。 NULL 流很特殊,因为它会导致隐式同步,如隐式同步中所述。

For code that is compiled without specifying a --default-stream compilation flag, --default-stream legacy is assumed as the default.
对于没有指定 --default-stream 编译标志的编码,默认情况下假定 --default-stream legacy

3.2.8.5.3. Explicit Synchronization
3.2.8.5.3. 显式同步 

There are various ways to explicitly synchronize streams with each other.
有多种显式同步流之间的方法。

cudaDeviceSynchronize() waits until all preceding commands in all streams of all host threads have completed.
cudaDeviceSynchronize() 等待直到所有主机线程的所有流中的所有前置命令完成。

cudaStreamSynchronize()takes a stream as a parameter and waits until all preceding commands in the given stream have completed. It can be used to synchronize the host with a specific stream, allowing other streams to continue executing on the device.
cudaStreamSynchronize() 以流作为参数,并等待给定流中所有先前的命令完成。它可用于将主机与特定流同步,从而允许其他流在设备上继续执行。

cudaStreamWaitEvent()takes a stream and an event as parameters (see Events for a description of events)and makes all the commands added to the given stream after the call to cudaStreamWaitEvent()delay their execution until the given event has completed.
cudaStreamWaitEvent() 接受流和事件作为参数(请参阅事件以了解事件的描述),并使得在调用 cudaStreamWaitEvent() 后添加到给定流中的所有命令延迟执行,直到给定事件完成。

cudaStreamQuery()provides applications with a way to know if all preceding commands in a stream have completed.
cudaStreamQuery() 为应用程序提供了一种方法,用于确定流中所有前置命令是否已完成。

3.2.8.5.4. Implicit Synchronization
3.2.8.5.4. 隐式同步 

Two commands from different streams cannot run concurrently if any one of the following operations is issued in-between them by the host thread:
如果主线程在两个不同流中发出以下任何一种操作,则这两个命令不能同时运行:

  • a page-locked host memory allocation,
    页面锁定的主机内存分配

  • a device memory allocation,
    设备内存分配

  • a device memory set, 设备内存集合

  • a memory copy between two addresses to the same device memory,
    在同一设备内存地址之间进行的内存复制

  • any CUDA command to the NULL stream,
    任何 CUDA 命令都可以发送到空流中

  • a switch between the L1/shared memory configurations described in Compute Capability 7.x.
    在计算能力 7.x 中描述的 L1/共享内存配置之间切换。

Operations that require a dependency check include any other commands within the same stream as the launch being checked and any call to cudaStreamQuery() on that stream. Therefore, applications should follow these guidelines to improve their potential for concurrent kernel execution:
需要依赖检查的操作包括与正在检查的启动在同一流中的任何其他命令以及对该流上的 cudaStreamQuery() 的任何调用。因此,应用程序应遵循以下准则以提高其并发内核执行的潜力:

  • All independent operations should be issued before dependent operations,
    所有独立操作应在依赖操作之前发出

  • Synchronization of any kind should be delayed as long as possible.
    尽可能延迟任何类型的同步。

3.2.8.5.5. Overlapping Behavior
3.2.8.5.5. 重叠行为 

The amount of execution overlap between two streams depends on the order in which the commands are issued to each stream and whether or not the device supports overlap of data transfer and kernel execution (see Overlap of Data Transfer and Kernel Execution), concurrent kernel execution (see Concurrent Kernel Execution), and/or concurrent data transfers (see Concurrent Data Transfers).
两个流之间的执行重叠量取决于向每个流发出命令的顺序以及设备是否支持数据传输和内核执行的重叠(请参阅数据传输和内核执行的重叠)、并发内核执行(请参阅并发内核执行)和/或并发数据传输(请参阅并发数据传输)。

For example, on devices that do not support concurrent data transfers, the two streams of the code sample of Creation and Destruction do not overlap at all because the memory copy from host to device is issued to stream[1] after the memory copy from device to host is issued to stream[0], so it can only start once the memory copy from device to host issued to stream[0] has completed. If the code is rewritten the following way (and assuming the device supports overlap of data transfer and kernel execution)
例如,在不支持并发数据传输的设备上,Creation 和 Destruction 代码示例中的两个流根本不重叠,因为从主机到设备的内存复制是在将从设备到主机的内存复制发出到 stream[0]之后发出到 stream[1]的,因此只能在将从设备到主机的内存复制发出到 stream[0]完成后才能开始。如果按照以下方式重写代码(并假设设备支持数据传输和内核执行的重叠)

for (int i = 0; i < 2; ++i)
    cudaMemcpyAsync(inputDevPtr + i * size, hostPtr + i * size,
                    size, cudaMemcpyHostToDevice, stream[i]);
for (int i = 0; i < 2; ++i)
    MyKernel<<<100, 512, 0, stream[i]>>>
          (outputDevPtr + i * size, inputDevPtr + i * size, size);
for (int i = 0; i < 2; ++i)
    cudaMemcpyAsync(hostPtr + i * size, outputDevPtr + i * size,
                    size, cudaMemcpyDeviceToHost, stream[i]);

then the memory copy from host to device issued to stream[1] overlaps with the kernel launch issued to stream[0].
然后从主机到设备的内存复制与发给 stream[0]的内核启动重叠。

On devices that do support concurrent data transfers, the two streams of the code sample of Creation and Destruction do overlap: The memory copy from host to device issued to stream[1] overlaps with the memory copy from device to host issued to stream[0] and even with the kernel launch issued to stream[0] (assuming the device supports overlap of data transfer and kernel execution).
在支持并发数据传输的设备上,Creation 和 Destruction 代码示例中的两个流会重叠:从主机到设备的内存复制发出到 stream[1]与从设备到主机的内存复制发出到 stream[0]甚至与发出到 stream[0]的内核启动重叠(假设设备支持数据传输和内核执行的重叠)。

3.2.8.5.6. Host Functions (Callbacks)
3.2.8.5.6. 主机函数(回调) 

The runtime provides a way to insert a CPU function call at any point into a stream via cudaLaunchHostFunc(). The provided function is executed on the host once all commands issued to the stream before the callback have completed.
运行时提供了一种通过 cudaLaunchHostFunc() 在流的任意点插入 CPU 函数调用的方式。一旦在回调之前发出到流的所有命令都已完成,提供的函数将在主机上执行。

The following code sample adds the host function MyCallback to each of two streams after issuing a host-to-device memory copy, a kernel launch and a device-to-host memory copy into each stream. The function will begin execution on the host after each of the device-to-host memory copies completes.
以下代码示例在将主机函数 MyCallback 添加到两个流中后,将在每个流中发出主机到设备内存复制、内核启动和设备到主机内存复制之后开始执行。该函数将在每个设备到主机内存复制完成后在主机上开始执行。

void CUDART_CB MyCallback(void *data){
    printf("Inside callback %d\n", (size_t)data);
}
...
for (size_t i = 0; i < 2; ++i) {
    cudaMemcpyAsync(devPtrIn[i], hostPtr[i], size, cudaMemcpyHostToDevice, stream[i]);
    MyKernel<<<100, 512, 0, stream[i]>>>(devPtrOut[i], devPtrIn[i], size);
    cudaMemcpyAsync(hostPtr[i], devPtrOut[i], size, cudaMemcpyDeviceToHost, stream[i]);
    cudaLaunchHostFunc(stream[i], MyCallback, (void*)i);
}

The commands that are issued in a stream after a host function do not start executing before the function has completed.
在主机函数完成之前发出的流中的命令不会开始执行。

A host function enqueued into a stream must not make CUDA API calls (directly or indirectly), as it might end up waiting on itself if it makes such a call leading to a deadlock.
将一个主机函数排入流中时,不得进行 CUDA API 调用(直接或间接),因为如果进行这样的调用,可能会导致等待自身而陷入死锁。

3.2.8.5.7. Stream Priorities
3.2.8.5.7. 流优先级 

The relative priorities of streams can be specified at creation using cudaStreamCreateWithPriority(). The range of allowable priorities, ordered as [ highest priority, lowest priority ] can be obtained using the cudaDeviceGetStreamPriorityRange() function. At runtime, pending work in higher-priority streams takes preference over pending work in low-priority streams.
流的相对优先级可以在创建时使用 cudaStreamCreateWithPriority() 来指定。可接受优先级范围,按[最高优先级,最低优先级]排序,可以使用 cudaDeviceGetStreamPriorityRange() 函数获得。在运行时,高优先级流中的待处理工作优先于低优先级流中的待处理工作。

The following code sample obtains the allowable range of priorities for the current device, and creates streams with the highest and lowest available priorities.
以下代码示例获取当前设备的可允许优先级范围,并创建具有最高和最低可用优先级的流。

// get the range of stream priorities for this device
int priority_high, priority_low;
cudaDeviceGetStreamPriorityRange(&priority_low, &priority_high);
// create streams with highest and lowest available priorities
cudaStream_t st_high, st_low;
cudaStreamCreateWithPriority(&st_high, cudaStreamNonBlocking, priority_high);
cudaStreamCreateWithPriority(&st_low, cudaStreamNonBlocking, priority_low);

3.2.8.6. Programmatic Dependent Launch and Synchronization
3.2.8.6. 程序化依赖启动和同步 

The Programmatic Dependent Launch mechanism allows for a dependent secondary kernel to launch before the primary kernel it depends on in the same CUDA stream has finished executing. Available starting with devices of compute capability 9.0, this technique can provide performance benefits when the secondary kernel can complete significant work that does not depend on the results of the primary kernel.
编程依赖启动机制允许在同一 CUDA 流中的主要内核完成执行之前启动依赖的次要内核。 从计算能力 9.0 的设备开始提供,当次要内核可以完成不依赖于主要内核结果的重要工作时,此技术可以提供性能优势。

3.2.8.6.1. Background 3.2.8.6.1. 背景 

A CUDA application utilizes the GPU by launching and executing multiple kernels on it. A typical GPU activity timeline is shown in Figure 10.
一个 CUDA 应用程序通过在 GPU 上启动和执行多个内核来利用 GPU。图 10 显示了典型的 GPU 活动时间表。

GPU activity timeline

Figure 10 GPU activity timeline
图 10 GPU 活动时间线 

Here, secondary_kernel is launched after primary_kernel finishes its execution. Serialized execution is usually necessary because secondary_kernel depends on result data produced by primary_kernel. If secondary_kernel has no dependency on primary_kernel, both of them can be launched concurrently by using CUDA streams. Even if secondary_kernel is dependent on primary_kernel, there is some potential for concurrent execution. For example, almost all the kernels have some sort of preamble section during which tasks such as zeroing buffers or loading constant values are performed.
在这里, secondary_kernelprimary_kernel 执行完之后启动。序列化执行通常是必要的,因为 secondary_kernel 依赖于 primary_kernel 产生的结果数据。如果 secondary_kernel 不依赖于 primary_kernel ,则可以通过使用 CUDA 流并发启动它们两个。即使 secondary_kernel 依赖于 primary_kernel ,也有一些潜力进行并发执行。例如,几乎所有的内核都有某种序言部分,在此期间执行诸如清零缓冲区或加载常量值等任务。

Preamble section of ``secondary_kernel``

Figure 11 Preamble section of secondary_kernel
图 11 secondary_kernel 的序言部分 

Figure 11 demonstrates the portion of secondary_kernel that could be executed concurrently without impacting the application. Note that concurrent launch also allows us to hide the launch latency of secondary_kernel behind the execution of primary_kernel.
图 11 展示了可以同时执行而不影响应用程序的 secondary_kernel 部分。请注意,并发启动还允许我们将 secondary_kernel 的启动延迟隐藏在 primary_kernel 的执行之后。

Concurrent execution of ``primary_kernel`` and ``secondary_kernel``

Figure 12 Concurrent execution of primary_kernel and secondary_kernel
图 12 primary_kernelsecondary_kernel 的并发执行 

The concurrent launch and execution of secondary_kernel shown in Figure 12 is achievable using Programmatic Dependent Launch.
图 12 中显示的 secondary_kernel 的并发启动和执行可通过编程依赖启动实现。

Programmatic Dependent Launch introduces changes to the CUDA kernel launch APIs as explained in following section. These APIs require at least compute capability 9.0 to provide overlapping execution.
程序依赖启动引入了对 CUDA 内核启动 API 的更改,如下部分所述。这些 API 需要至少计算能力 9.0 才能提供重叠执行。

3.2.8.6.2. API Description
3.2.8.6.2. API 描述 

In Programmatic Dependent Launch, a primary and a secondary kernel are launched in the same CUDA stream. The primary kernel should execute cudaTriggerProgrammaticLaunchCompletion with all thread blocks when it’s ready for the secondary kernel to launch. The secondary kernel must be launched using the extensible launch API as shown.
在程序化依赖启动中,在同一 CUDA 流中启动主要内核和次要内核。主要内核应在准备好启动次要内核时执行 cudaTriggerProgrammaticLaunchCompletion 的所有线程块。次要内核必须使用可扩展启动 API 启动,如所示。

__global__ void primary_kernel() {
   // Initial work that should finish before starting secondary kernel

   // Trigger the secondary kernel
   cudaTriggerProgrammaticLaunchCompletion();

   // Work that can coincide with the secondary kernel
}

__global__ void secondary_kernel()
{
   // Independent work

   // Will block until all primary kernels the secondary kernel is dependent on have completed and flushed results to global memory
   cudaGridDependencySynchronize();

   // Dependent work
}

cudaLaunchAttribute attribute[1];
attribute[0].id = cudaLaunchAttributeProgrammaticStreamSerialization;
attribute[0].val.programmaticStreamSerializationAllowed = 1;
configSecondary.attrs = attribute;
configSecondary.numAttrs = 1;

primary_kernel<<<grid_dim, block_dim, 0, stream>>>();
cudaLaunchKernelEx(&configSecondary, secondary_kernel);

When the secondary kernel is launched using the cudaLaunchAttributeProgrammaticStreamSerialization attribute, the CUDA driver is safe to launch the secondary kernel early and not wait on the completion and memory flush of the primary before launching the secondary.
当使用 cudaLaunchAttributeProgrammaticStreamSerialization 属性启动次要内核时,CUDA 驱动程序可以安全地提前启动次要内核,而无需等待主内核完成和内存刷新后再启动次要内核。

The CUDA driver can launch the secondary kernel when all primary thread blocks have launched and executed cudaTriggerProgrammaticLaunchCompletion. If the primary kernel doesn’t execute the trigger, it implicitly occurs after all thread blocks in the primary kernel exit.
当所有主线程块已启动和执行 cudaTriggerProgrammaticLaunchCompletion 时,CUDA 驱动程序可以启动次要内核。如果主内核未执行触发器,则在主内核中的所有线程块退出后隐式发生。

In either case, the secondary thread blocks might launch before data written by the primary kernel is visible. As such, when the secondary kernel is configured with Programmatic Dependent Launch, it must always use cudaGridDependencySynchronize or other means to verify that the result data from the primary is available.
在任何情况下,次要线程块可能会在主内核写入的数据可见之前启动。因此,当使用程序依赖启动配置次要内核时,它必须始终使用 cudaGridDependencySynchronize 或其他方式来验证主内核的结果数据是否可用。

Please note that these methods provide the opportunity for the primary and secondary kernels to execute concurrently, however this behavior is opportunistic and not guaranteed to lead to concurrent kernel execution. Reliance on concurrent execution in this manner is unsafe and can lead to deadlock.
请注意,这些方法提供了主要和次要内核同时执行的机会,但这种行为是机会性的,不能保证会导致内核的并发执行。依赖这种方式的并发执行是不安全的,可能会导致死锁。

3.2.8.6.3. Use in CUDA Graphs
3.2.8.6.3. 在 CUDA 图中使用 

Programmatic Dependent Launch can be used in CUDA Graphs via stream capture or directly via edge data. To program this feature in a CUDA Graph with edge data, use a cudaGraphDependencyType value of cudaGraphDependencyTypeProgrammatic on an edge connecting two kernel nodes. This edge type makes the upstream kernel visible to a cudaGridDependencySynchronize() in the downstream kernel. This type must be used with an outgoing port of either cudaGraphKernelNodePortLaunchCompletion or cudaGraphKernelNodePortProgrammatic.
通过流捕获或直接通过边缘数据,CUDA 图中可以使用程序化的依赖启动。要在具有边缘数据的 CUDA 图中编程此功能,请在连接两个内核节点的边缘上使用值为 cudaGraphDependencyTypeProgrammaticcudaGraphDependencyType 。此边缘类型使上游内核对下游内核中的 cudaGridDependencySynchronize() 可见。此类型必须与出站端口 cudaGraphKernelNodePortLaunchCompletioncudaGraphKernelNodePortProgrammatic 之一一起使用。

The resulting graph equivalents for stream capture are as follows:
流捕获的结果图等效如下:

Stream code (abbreviated)
流代码(缩写)

Resulting graph edge 生成的图边

cudaLaunchAttribute attribute;
attribute.id = cudaLaunchAttributeProgrammaticStreamSerialization;
attribute.val.programmaticStreamSerializationAllowed = 1;
cudaGraphEdgeData edgeData;
edgeData.type = cudaGraphDependencyTypeProgrammatic;
edgeData.from_port = cudaGraphKernelNodePortProgrammatic;
cudaLaunchAttribute attribute;
attribute.id = cudaLaunchAttributeProgrammaticEvent;
attribute.val.programmaticEvent.triggerAtBlockStart = 0;
cudaGraphEdgeData edgeData;
edgeData.type = cudaGraphDependencyTypeProgrammatic;
edgeData.from_port = cudaGraphKernelNodePortProgrammatic;
cudaLaunchAttribute attribute;
attribute.id = cudaLaunchAttributeProgrammaticEvent;
attribute.val.programmaticEvent.triggerAtBlockStart = 1;
cudaGraphEdgeData edgeData;
edgeData.type = cudaGraphDependencyTypeProgrammatic;
edgeData.from_port = cudaGraphKernelNodePortLaunchCompletion;

3.2.8.7. CUDA Graphs
3.2.8.7. CUDA 图表 

CUDA Graphs present a new model for work submission in CUDA. A graph is a series of operations, such as kernel launches, connected by dependencies, which is defined separately from its execution. This allows a graph to be defined once and then launched repeatedly. Separating out the definition of a graph from its execution enables a number of optimizations: first, CPU launch costs are reduced compared to streams, because much of the setup is done in advance; second, presenting the whole workflow to CUDA enables optimizations which might not be possible with the piecewise work submission mechanism of streams.
CUDA 图形介绍了 CUDA 中工作提交的新模型。图形是一系列操作,如核心启动,通过依赖关系连接,与其执行分开定义。这使得可以一次定义图形,然后重复启动。将图形的定义与执行分开使得可以进行多种优化:首先,与流相比,CPU 启动成本降低,因为大部分设置是提前完成的;其次,将整个工作流程呈现给 CUDA 使得可以进行一些可能无法通过流的逐步工作提交机制实现的优化。

To see the optimizations possible with graphs, consider what happens in a stream: when you place a kernel into a stream, the host driver performs a sequence of operations in preparation for the execution of the kernel on the GPU. These operations, necessary for setting up and launching the kernel, are an overhead cost which must be paid for each kernel that is issued. For a GPU kernel with a short execution time, this overhead cost can be a significant fraction of the overall end-to-end execution time.
要查看图形优化的可能性,请考虑在流中发生的情况:当您将内核放入流中时,主机驱动程序会执行一系列操作,为在 GPU 上执行内核做准备。这些操作是为设置和启动内核而必需的,是必须为每个发出的内核支付的开销成本。对于执行时间较短的 GPU 内核,这种开销成本可能占整体端到端执行时间的相当大部分。

Work submission using graphs is separated into three distinct stages: definition, instantiation, and execution.
使用图形进行工作提交分为三个明确的阶段:定义、实例化和执行。

  • During the definition phase, a program creates a description of the operations in the graph along with the dependencies between them.
    在定义阶段,程序会创建一个图中操作的描述,以及它们之间的依赖关系。

  • Instantiation takes a snapshot of the graph template, validates it, and performs much of the setup and initialization of work with the aim of minimizing what needs to be done at launch. The resulting instance is known as an executable graph.
    实例化会对图模板进行快照,验证它,并执行大部分设置和初始化工作,以最大限度地减少启动时需要执行的工作。生成的实例被称为可执行图。

  • An executable graph may be launched into a stream, similar to any other CUDA work. It may be launched any number of times without repeating the instantiation.
    可执行图可能会被启动到流中,类似于任何其他 CUDA 工作。它可以被启动任意次数,而无需重复实例化。

3.2.8.7.1. Graph Structure
3.2.8.7.1. 图结构 

An operation forms a node in a graph. The dependencies between the operations are the edges. These dependencies constrain the execution sequence of the operations.
操作在图中形成一个节点。操作之间的依赖关系就是边。这些依赖关系限制了操作的执行顺序。

An operation may be scheduled at any time once the nodes on which it depends are complete. Scheduling is left up to the CUDA system.
操作可能在依赖节点完成后的任何时间安排。调度由 CUDA 系统决定。

3.2.8.7.1.1. Node Types
3.2.8.7.1.1. 节点类型 

A graph node can be one of:
图节点可以是以下之一:

  • kernel 内核

  • CPU function call CPU 功能调用

  • memory copy 内存复制

  • memset

  • empty node 空节点

  • waiting on an event 等待事件

  • recording an event 记录事件

  • signalling an external semaphore
    发出外部信号量

  • waiting on an external semaphore
    等待外部信号量

  • conditional node 条件节点

  • child graph: To execute a separate nested graph, as shown in the following figure.
    子图:执行一个单独的嵌套图,如下图所示。

Child Graph Example

Figure 13 Child Graph Example
图 13 子图示例 

3.2.8.7.1.2. Edge Data
3.2.8.7.1.2. 边缘数据 

CUDA 12.3 introduced edge data on CUDA Graphs. Edge data modifies a dependency specified by an edge and consists of three parts: an outgoing port, an incoming port, and a type. An outgoing port specifies when an associated edge is triggered. An incoming port specifies what portion of a node is dependent on an associated edge. A type modifies the relation between the endpoints.
CUDA 12.3 在 CUDA 图中引入了边缘数据。边缘数据修改了由边缘指定的依赖关系,包括三个部分:出站端口、入站端口和类型。出站端口指定了关联边缘何时触发。入站端口指定了节点的哪一部分依赖于关联边缘。类型修改了端点之间的关系。

Port values are specific to node type and direction, and edge types may be restricted to specific node types. In all cases, zero-initialized edge data represents default behavior. Outgoing port 0 waits on an entire task, incoming port 0 blocks an entire task, and edge type 0 is associated with a full dependency with memory synchronizing behavior.
端口值是特定于节点类型和方向的,边缘类型可能受限于特定的节点类型。在所有情况下,零初始化的边缘数据代表默认行为。传出端口 0 等待整个任务,传入端口 0 阻止整个任务,边缘类型 0 与具有内存同步行为的完全依赖关联。

Edge data is optionally specified in various graph APIs via a parallel array to the associated nodes. If it is omitted as an input parameter, zero-initialized data is used. If it is omitted as an output (query) parameter, the API accepts this if the edge data being ignored is all zero-initialized, and returns cudaErrorLossyQuery if the call would discard information.
边缘数据可以通过与关联节点的并行数组在各种图形 API 中进行可选指定。如果作为输入参数被省略,将使用零初始化的数据。如果作为输出(查询)参数被省略,API 将接受此操作,如果被忽略的边缘数据全部为零初始化,则返回 cudaErrorLossyQuery ,如果调用会丢弃信息。

Edge data is also available in some stream capture APIs: cudaStreamBeginCaptureToGraph(), cudaStreamGetCaptureInfo(), and cudaStreamUpdateCaptureDependencies(). In these cases, there is not yet a downstream node. The data is associated with a dangling edge (half edge) which will either be connected to a future captured node or discarded at termination of stream capture. Note that some edge types do not wait on full completion of the upstream node. These edges are ignored when considering if a stream capture has been fully rejoined to the origin stream, and cannot be discarded at the end of capture. See Creating a Graph Using Stream Capture.
边缘数据还可以在一些流捕获 API 中使用: cudaStreamBeginCaptureToGraph()cudaStreamGetCaptureInfo()cudaStreamUpdateCaptureDependencies() 。在这些情况下,尚未有下游节点。数据与悬空边缘(半边)相关联,该边缘将连接到将来捕获的节点或在流捕获终止时丢弃。请注意,某些边缘类型不会等待上游节点完全完成。在考虑流捕获是否已完全重新连接到原始流时,将忽略这些边缘,并且不能在捕获结束时丢弃。请参阅使用流捕获创建图。

Currently, no node types define additional incoming ports, and only kernel nodes define additional outgoing ports. There is one non-default dependency type, cudaGraphDependencyTypeProgrammatic, which enables Programmatic Dependent Launch between two kernel nodes.
目前,没有节点类型定义额外的传入端口,只有内核节点定义额外的传出端口。有一种非默认的依赖类型, cudaGraphDependencyTypeProgrammatic ,它在两个内核节点之间启用程序化依赖启动。

3.2.8.7.2. Creating a Graph Using Graph APIs
3.2.8.7.2. 使用图形 API 创建图形 

Graphs can be created via two mechanisms: explicit API and stream capture. The following is an example of creating and executing the below graph.
图表可以通过两种机制创建:显式 API 和流捕获。以下是创建和执行下面图表的示例。

Creating a Graph Using Graph APIs Example

Figure 14 Creating a Graph Using Graph APIs Example
图 14 使用图形 API 示例创建图形 

// Create the graph - it starts out empty
cudaGraphCreate(&graph, 0);

// For the purpose of this example, we'll create
// the nodes separately from the dependencies to
// demonstrate that it can be done in two stages.
// Note that dependencies can also be specified
// at node creation.
cudaGraphAddKernelNode(&a, graph, NULL, 0, &nodeParams);
cudaGraphAddKernelNode(&b, graph, NULL, 0, &nodeParams);
cudaGraphAddKernelNode(&c, graph, NULL, 0, &nodeParams);
cudaGraphAddKernelNode(&d, graph, NULL, 0, &nodeParams);

// Now set up dependencies on each node
cudaGraphAddDependencies(graph, &a, &b, 1);     // A->B
cudaGraphAddDependencies(graph, &a, &c, 1);     // A->C
cudaGraphAddDependencies(graph, &b, &d, 1);     // B->D
cudaGraphAddDependencies(graph, &c, &d, 1);     // C->D
3.2.8.7.3. Creating a Graph Using Stream Capture
3.2.8.7.3. 使用流捕获创建图形 

Stream capture provides a mechanism to create a graph from existing stream-based APIs. A section of code which launches work into streams, including existing code, can be bracketed with calls to cudaStreamBeginCapture() and cudaStreamEndCapture(). See below.
流捕获提供了一种从现有基于流的 API 创建图形的机制。 启动工作到流中的代码部分,包括现有代码,可以用调用 cudaStreamBeginCapture()cudaStreamEndCapture() 括起来。 请参见下文。

cudaGraph_t graph;

cudaStreamBeginCapture(stream);

kernel_A<<< ..., stream >>>(...);
kernel_B<<< ..., stream >>>(...);
libraryCall(stream);
kernel_C<<< ..., stream >>>(...);

cudaStreamEndCapture(stream, &graph);

A call to cudaStreamBeginCapture() places a stream in capture mode. When a stream is being captured, work launched into the stream is not enqueued for execution. It is instead appended to an internal graph that is progressively being built up. This graph is then returned by calling cudaStreamEndCapture(), which also ends capture mode for the stream. A graph which is actively being constructed by stream capture is referred to as a capture graph.
调用 cudaStreamBeginCapture() 会将流置于捕获模式。当流正在被捕获时,进入流的工作不会被排队执行,而是被附加到逐渐构建的内部图中。然后通过调用 cudaStreamEndCapture() 返回此图,同时也结束流的捕获模式。通过流捕获活跃构建的图称为捕获图。

Stream capture can be used on any CUDA stream except cudaStreamLegacy (the “NULL stream”). Note that it can be used on cudaStreamPerThread. If a program is using the legacy stream, it may be possible to redefine stream 0 to be the per-thread stream with no functional change. See Default Stream.
流捕获可以在除 cudaStreamLegacy (“NULL 流”)之外的任何 CUDA 流上使用。请注意,它可以在 cudaStreamPerThread 上使用。如果程序正在使用传统流,可能可以重新定义流 0 为每个线程流,而不会有功能性变化。请参阅默认流。

Whether a stream is being captured can be queried with cudaStreamIsCapturing().
流是否正在被捕获可以使用 cudaStreamIsCapturing() 进行查询。

Work can be captured to an existing graph using cudaStreamBeginCaptureToGraph(). Instead of capturing to an internal graph, work is captured to a graph provided by the user.
工作可以使用 cudaStreamBeginCaptureToGraph() 捕获到现有图形。工作不是捕获到内部图形,而是捕获到用户提供的图形。

3.2.8.7.3.1. Cross-stream Dependencies and Events
3.2.8.7.3.1. 跨流依赖和事件 

Stream capture can handle cross-stream dependencies expressed with cudaEventRecord() and cudaStreamWaitEvent(), provided the event being waited upon was recorded into the same capture graph.
流捕获可以处理使用 cudaEventRecord()cudaStreamWaitEvent() 表示的跨流依赖关系,前提是等待的事件已记录到同一捕获图中。

When an event is recorded in a stream that is in capture mode, it results in a captured event. A captured event represents a set of nodes in a capture graph.
当在处于捕获模式的流中记录事件时,会产生一个捕获事件。捕获事件代表捕获图中的一组节点。

When a captured event is waited on by a stream, it places the stream in capture mode if it is not already, and the next item in the stream will have additional dependencies on the nodes in the captured event. The two streams are then being captured to the same capture graph.
当一个捕获的事件被流等待时,如果流尚未处于捕获模式,则将其置于捕获模式,并且流中的下一个项目将对捕获事件中的节点具有额外的依赖关系。然后这两个流被捕获到同一个捕获图中。

When cross-stream dependencies are present in stream capture, cudaStreamEndCapture() must still be called in the same stream where cudaStreamBeginCapture() was called; this is the origin stream. Any other streams which are being captured to the same capture graph, due to event-based dependencies, must also be joined back to the origin stream. This is illustrated below. All streams being captured to the same capture graph are taken out of capture mode upon cudaStreamEndCapture(). Failure to rejoin to the origin stream will result in failure of the overall capture operation.
当流捕获中存在跨流依赖时,仍然必须在调用 cudaStreamBeginCapture() 的相同流中调用 cudaStreamEndCapture() ;这是原始流。由于基于事件的依赖关系,任何其他被捕获到相同捕获图中的流也必须重新加入到原始流中。如下所示。所有被捕获到相同捕获图中的流在 cudaStreamEndCapture() 时退出捕获模式。未重新加入原始流将导致整体捕获操作失败。

// stream1 is the origin stream
cudaStreamBeginCapture(stream1);

kernel_A<<< ..., stream1 >>>(...);

// Fork into stream2
cudaEventRecord(event1, stream1);
cudaStreamWaitEvent(stream2, event1);

kernel_B<<< ..., stream1 >>>(...);
kernel_C<<< ..., stream2 >>>(...);

// Join stream2 back to origin stream (stream1)
cudaEventRecord(event2, stream2);
cudaStreamWaitEvent(stream1, event2);

kernel_D<<< ..., stream1 >>>(...);

// End capture in the origin stream
cudaStreamEndCapture(stream1, &graph);

// stream1 and stream2 no longer in capture mode

Graph returned by the above code is shown in Figure 14.
上述代码返回的图表如图 14 所示。

Note 注意

When a stream is taken out of capture mode, the next non-captured item in the stream (if any) will still have a dependency on the most recent prior non-captured item, despite intermediate items having been removed.
当流被移出捕获模式时,流中的下一个非捕获项目(如果有的话)仍将依赖于最近的先前非捕获项目,尽管中间项目已被移除。

3.2.8.7.3.2. Prohibited and Unhandled Operations
3.2.8.7.3.2. 禁止和未处理的操作 

It is invalid to synchronize or query the execution status of a stream which is being captured or a captured event, because they do not represent items scheduled for execution. It is also invalid to query the execution status of or synchronize a broader handle which encompasses an active stream capture, such as a device or context handle when any associated stream is in capture mode.
尝试同步或查询正在被捕获或已捕获事件的执行状态是无效的,因为它们不代表已安排执行的项目。当任何相关流处于捕获模式时,尝试查询或同步包含活动流捕获的更广泛句柄(如设备或上下文句柄)的执行状态也是无效的。

When any stream in the same context is being captured, and it was not created with cudaStreamNonBlocking, any attempted use of the legacy stream is invalid. This is because the legacy stream handle at all times encompasses these other streams; enqueueing to the legacy stream would create a dependency on the streams being captured, and querying it or synchronizing it would query or synchronize the streams being captured.
当在相同上下文中捕获任何流时,并且它不是使用 cudaStreamNonBlocking 创建的,任何尝试使用传统流的操作都是无效的。这是因为传统流句柄始终包含这些其他流;将任务排入传统流将创建对被捕获流的依赖,并且查询或同步它将查询或同步被捕获的流。

It is therefore also invalid to call synchronous APIs in this case. Synchronous APIs, such as cudaMemcpy(), enqueue work to the legacy stream and synchronize it before returning.
因此,在这种情况下调用同步 API 也是无效的。同步 API,例如 cudaMemcpy() ,会将工作排入传统流中并在返回之前进行同步。

Note 注意

As a general rule, when a dependency relation would connect something that is captured with something that was not captured and instead enqueued for execution, CUDA prefers to return an error rather than ignore the dependency. An exception is made for placing a stream into or out of capture mode; this severs a dependency relation between items added to the stream immediately before and after the mode transition.
通常情况下,当依赖关系将连接捕获的内容与未捕获的内容(而是排队执行)时,CUDA 更倾向于返回错误而不是忽略依赖关系。一个例外是将流放入或退出捕获模式;这会立即切断在模式转换之前和之后添加到流中的项目之间的依赖关系。

It is invalid to merge two separate capture graphs by waiting on a captured event from a stream which is being captured and is associated with a different capture graph than the event. It is invalid to wait on a non-captured event from a stream which is being captured without specifying the cudaEventWaitExternal flag.
无效将等待从正在捕获并与事件不同的捕获图关联的流中捕获的事件合并为两个单独的捕获图。在捕获流而没有指定 cudaEventWaitExternal 标志的情况下等待未捕获的事件是无效的。

A small number of APIs that enqueue asynchronous operations into streams are not currently supported in graphs and will return an error if called with a stream which is being captured, such as cudaStreamAttachMemAsync().
当前不支持在图形中将异步操作排入流的少量 API,如果使用正在捕获的流调用,将返回错误,例如 cudaStreamAttachMemAsync()

3.2.8.7.3.3. Invalidation
3.2.8.7.3.3. 失效化 

When an invalid operation is attempted during stream capture, any associated capture graphs are invalidated. When a capture graph is invalidated, further use of any streams which are being captured or captured events associated with the graph is invalid and will return an error, until stream capture is ended with cudaStreamEndCapture(). This call will take the associated streams out of capture mode, but will also return an error value and a NULL graph.
在流捕获期间尝试无效操作时,任何关联的捕获图形都将无效。当捕获图形无效时,任何正在被捕获或与图形相关的捕获事件的进一步使用都是无效的,并将返回错误,直到使用 cudaStreamEndCapture() 结束流捕获。此调用将使关联的流退出捕获模式,但也将返回错误值和空图形。

3.2.8.7.4. CUDA User Objects
3.2.8.7.4. CUDA 用户对象 

CUDA User Objects can be used to help manage the lifetime of resources used by asynchronous work in CUDA. In particular, this feature is useful for CUDA Graphs and stream capture.
CUDA 用户对象可用于帮助管理 CUDA 中异步工作使用的资源的生命周期。特别是,此功能对 CUDA 图形和流捕获非常有用。

Various resource management schemes are not compatible with CUDA graphs. Consider for example an event-based pool or a synchronous-create, asynchronous-destroy scheme.
各种资源管理方案与 CUDA 图形不兼容。例如,基于事件的池或同步创建、异步销毁方案。

// Library API with pool allocation
void libraryWork(cudaStream_t stream) {
    auto &resource = pool.claimTemporaryResource();
    resource.waitOnReadyEventInStream(stream);
    launchWork(stream, resource);
    resource.recordReadyEvent(stream);
}
// Library API with asynchronous resource deletion
void libraryWork(cudaStream_t stream) {
    Resource *resource = new Resource(...);
    launchWork(stream, resource);
    cudaStreamAddCallback(
        stream,
        [](cudaStream_t, cudaError_t, void *resource) {
            delete static_cast<Resource *>(resource);
        },
        resource,
        0);
    // Error handling considerations not shown
}

These schemes are difficult with CUDA graphs because of the non-fixed pointer or handle for the resource which requires indirection or graph update, and the synchronous CPU code needed each time the work is submitted. They also do not work with stream capture if these considerations are hidden from the caller of the library, and because of use of disallowed APIs during capture. Various solutions exist such as exposing the resource to the caller. CUDA user objects present another approach.
这些方案在 CUDA 图中很难实现,因为资源的指针或句柄不固定,需要间接引用或图更新,并且每次提交工作时都需要同步的 CPU 代码。如果这些考虑因素对库的调用者隐藏,并且在捕获期间使用了不允许的 API,那么它们也无法与流捕获一起工作。存在各种解决方案,例如向调用者公开资源。CUDA 用户对象提供了另一种方法。

A CUDA user object associates a user-specified destructor callback with an internal refcount, similar to C++ shared_ptr. References may be owned by user code on the CPU and by CUDA graphs. Note that for user-owned references, unlike C++ smart pointers, there is no object representing the reference; users must track user-owned references manually. A typical use case would be to immediately move the sole user-owned reference to a CUDA graph after the user object is created.
CUDA 用户对象将用户指定的析构回调与内部引用计数相关联,类似于 C++ shared_ptr 。引用可能由 CPU 上的用户代码和 CUDA 图所拥有。请注意,对于用户拥有的引用,与 C++智能指针不同,没有表示引用的对象;用户必须手动跟踪用户拥有的引用。典型用例是在创建用户对象后立即将唯一的用户拥有的引用移动到 CUDA 图中。

When a reference is associated to a CUDA graph, CUDA will manage the graph operations automatically. A cloned cudaGraph_t retains a copy of every reference owned by the source cudaGraph_t, with the same multiplicity. An instantiated cudaGraphExec_t retains a copy of every reference in the source cudaGraph_t. When a cudaGraphExec_t is destroyed without being synchronized, the references are retained until the execution is completed.
当引用与 CUDA 图关联时,CUDA 将自动管理图操作。克隆的 cudaGraph_t 保留了源 cudaGraph_t 拥有的每个引用的副本,具有相同的多重性。实例化的 cudaGraphExec_t 保留了源 cudaGraph_t 中每个引用的副本。当 cudaGraphExec_t 在未同步的情况下被销毁时,引用将保留直到执行完成。

Here is an example use.
这里是一个示例用法。

cudaGraph_t graph;  // Preexisting graph

Object *object = new Object;  // C++ object with possibly nontrivial destructor
cudaUserObject_t cuObject;
cudaUserObjectCreate(
    &cuObject,
    object,  // Here we use a CUDA-provided template wrapper for this API,
             // which supplies a callback to delete the C++ object pointer
    1,  // Initial refcount
    cudaUserObjectNoDestructorSync  // Acknowledge that the callback cannot be
                                    // waited on via CUDA
);
cudaGraphRetainUserObject(
    graph,
    cuObject,
    1,  // Number of references
    cudaGraphUserObjectMove  // Transfer a reference owned by the caller (do
                             // not modify the total reference count)
);
// No more references owned by this thread; no need to call release API
cudaGraphExec_t graphExec;
cudaGraphInstantiate(&graphExec, graph, nullptr, nullptr, 0);  // Will retain a
                                                               // new reference
cudaGraphDestroy(graph);  // graphExec still owns a reference
cudaGraphLaunch(graphExec, 0);  // Async launch has access to the user objects
cudaGraphExecDestroy(graphExec);  // Launch is not synchronized; the release
                                  // will be deferred if needed
cudaStreamSynchronize(0);  // After the launch is synchronized, the remaining
                           // reference is released and the destructor will
                           // execute. Note this happens asynchronously.
// If the destructor callback had signaled a synchronization object, it would
// be safe to wait on it at this point.

References owned by graphs in child graph nodes are associated to the child graphs, not the parents. If a child graph is updated or deleted, the references change accordingly. If an executable graph or child graph is updated with cudaGraphExecUpdate or cudaGraphExecChildGraphNodeSetParams, the references in the new source graph are cloned and replace the references in the target graph. In either case, if previous launches are not synchronized, any references which would be released are held until the launches have finished executing.
子图节点中由图形拥有的引用与子图相关联,而不是与父图相关联。如果更新或删除子图,则引用会相应更改。如果使用 cudaGraphExecUpdatecudaGraphExecChildGraphNodeSetParams 更新可执行图或子图,则新源图中的引用会被克隆并替换目标图中的引用。在任一情况下,如果之前的启动未同步,任何将被释放的引用都将保留,直到启动执行完成。

There is not currently a mechanism to wait on user object destructors via a CUDA API. Users may signal a synchronization object manually from the destructor code. In addition, it is not legal to call CUDA APIs from the destructor, similar to the restriction on cudaLaunchHostFunc. This is to avoid blocking a CUDA internal shared thread and preventing forward progress. It is legal to signal another thread to perform an API call, if the dependency is one way and the thread doing the call cannot block forward progress of CUDA work.
目前没有通过 CUDA API 等待用户对象析构函数的机制。用户可以从析构函数代码手动信号同步对象。此外,从析构函数调用 CUDA API 是不合法的,类似于对 cudaLaunchHostFunc 的限制。这是为了避免阻塞 CUDA 内部共享线程并阻止前进。如果依赖是单向的,并且执行调用的线程不能阻塞 CUDA 工作的前进,那么信号另一个线程执行 API 调用是合法的。

User objects are created with cudaUserObjectCreate, which is a good starting point to browse related APIs.
用户对象是使用 cudaUserObjectCreate 创建的,这是浏览相关 API 的良好起点。

3.2.8.7.5. Updating Instantiated Graphs
3.2.8.7.5. 更新实例化图形 

Work submission using graphs is separated into three distinct stages: definition, instantiation, and execution. In situations where the workflow is not changing, the overhead of definition and instantiation can be amortized over many executions, and graphs provide a clear advantage over streams.
使用图形进行工作提交分为三个明确定的阶段:定义、实例化和执行。在工作流程不发生变化的情况下,定义和实例化的开销可以分摊到许多执行中,并且图形相对于流具有明显优势。

A graph is a snapshot of a workflow, including kernels, parameters, and dependencies, in order to replay it as rapidly and efficiently as possible. In situations where the workflow changes the graph becomes out of date and must be modified. Major changes to graph structure such as topology or types of nodes will require re-instantiation of the source graph because various topology-related optimization techniques must be re-applied.
图是工作流程的快照,包括内核、参数和依赖关系,以便尽可能快速高效地重放。在工作流程发生变化的情况下,图会变得过时,必须进行修改。对图结构进行重大更改,如拓扑或节点类型,将需要重新实例化源图,因为必须重新应用各种与拓扑相关的优化技术。

The cost of repeated instantiation can reduce the overall performance benefit from graph execution, but it is common for only node parameters, such as kernel parameters and cudaMemcpy addresses, to change while graph topology remains the same. For this case, CUDA provides a lightweight mechanism known as “Graph Update,” which allows certain node parameters to be modified in-place without having to rebuild the entire graph. This is much more efficient than re-instantiation.
重复实例化的成本可能会降低图执行的整体性能优势,但通常只有节点参数(如内核参数和 cudaMemcpy 地址)会发生变化,而图拓扑保持不变。对于这种情况,CUDA 提供了一种称为“图更新”的轻量级机制,允许在不必重新构建整个图的情况下就地修改某些节点参数。这比重新实例化要高效得多。

Updates will take effect the next time the graph is launched, so they will not impact previous graph launches, even if they are running at the time of the update. A graph may be updated and relaunched repeatedly, so multiple updates/launches can be queued on a stream.
更新将在下次启动图形时生效,因此它们不会影响先前的图形启动,即使它们在更新时正在运行。图形可以被重复更新和重新启动,因此可以在流上排队多个更新/启动。

CUDA provides two mechanisms for updating instantiated graph parameters, whole graph update and individual node update. Whole graph update allows the user to supply a topologically identical cudaGraph_t object whose nodes contain updated parameters. Individual node update allows the user to explicitly update the parameters of individual nodes. Using an updated cudaGraph_t is more convenient when a large number of nodes are being updated, or when the graph topology is unknown to the caller (i.e., The graph resulted from stream capture of a library call). Using individual node update is preferred when the number of changes is small and the user has the handles to the nodes requiring updates. Individual node update skips the topology checks and comparisons for unchanged nodes, so it can be more efficient in many cases.
CUDA 提供了两种机制来更新实例化图参数,整个图更新和单个节点更新。整个图更新允许用户提供一个拓扑相同的 cudaGraph_t 对象,其节点包含更新后的参数。单个节点更新允许用户显式更新单个节点的参数。当更新的节点数量较大或者调用者不知道图拓扑结构时(即从库调用的流捕获结果的图),使用更新后的 cudaGraph_t 更方便。当更改数量较小且用户拥有需要更新的节点的句柄时,首选使用单个节点更新。单个节点更新跳过未更改节点的拓扑检查和比较,因此在许多情况下可能更有效率。

CUDA also provides a mechanism for enabling and disabling individual nodes without affecting their current parameters.
CUDA 还提供了一种机制,可以启用和禁用单个节点,而不影响它们当前的参数。

The following sections explain each approach in more detail.
以下部分详细解释了每种方法。

3.2.8.7.5.1. Graph Update Limitations
3.2.8.7.5.1. 图更新限制 

Kernel nodes: 内核节点:

  • The owning context of the function cannot change.
    函数的拥有上下文不能改变。

  • A node whose function originally did not use CUDA dynamic parallelism cannot be updated to a function which uses CUDA dynamic parallelism.
    一个节点的功能原本不使用 CUDA 动态并行性,不能更新为使用 CUDA 动态并行性的功能。

cudaMemset and cudaMemcpy nodes:  cudaMemsetcudaMemcpy 节点:

  • The CUDA device(s) to which the operand(s) was allocated/mapped cannot change.
    操作数分配/映射到的 CUDA 设备不能更改。

  • The source/destination memory must be allocated from the same context as the original source/destination memory.
    源/目标内存必须从与原始源/目标内存相同的上下文中分配。

  • Only 1D cudaMemset/cudaMemcpy nodes can be changed.
    只能更改 1D cudaMemset / cudaMemcpy 节点。

Additional memcpy node restrictions:
额外的 memcpy 节点限制:

  • Changing either the source or destination memory type (i.e., cudaPitchedPtr, cudaArray_t, etc.), or the type of transfer (i.e., cudaMemcpyKind) is not supported.
    更改源内存类型或目标内存类型(即 cudaPitchedPtrcudaArray_t 等),或传输类型(即 cudaMemcpyKind )不受支持。

External semaphore wait nodes and record nodes:
外部信号量等待节点和记录节点:

  • Changing the number of semaphores is not supported.
    不支持更改信号量的数量。

Conditional nodes: 条件节点:

  • The order of handle creation and assignment must match between the graphs.
    处理创建和分配的顺序在图形之间必须匹配。

  • Changing node parameters is not supported (i.e. number of graphs in the conditional, node context, etc).
    不支持更改节点参数(例如条件中的图数、节点上下文等)。

  • Changing parameters of nodes within the conditional body graph is subject to the rules above.
    更改条件体图中节点的参数受上述规则约束。

There are no restrictions on updates to host nodes, event record nodes, or event wait nodes.
对主机节点、事件记录节点或事件等待节点的更新没有限制。

3.2.8.7.5.2. Whole Graph Update
3.2.8.7.5.2. 整个图更新 

cudaGraphExecUpdate() allows an instantiated graph (the “original graph”) to be updated with the parameters from a topologically identical graph (the “updating” graph). The topology of the updating graph must be identical to the original graph used to instantiate the cudaGraphExec_t. In addition, the order in which the dependencies are specified must match. Finally, CUDA needs to consistently order the sink nodes (nodes with no dependencies). CUDA relies on the order of specific api calls to achieve consistent sink node ordering.
cudaGraphExecUpdate() 允许使用来自拓扑结构相同的图(“更新”图)的参数更新已实例化的图(“原始图”)。 更新图的拓扑结构必须与用于实例化 cudaGraphExec_t 的原始图相同。此外,指定依赖项的顺序必须匹配。最后,CUDA 需要一致地对汇聚节点(没有依赖项的节点)进行排序。CUDA 依赖于特定 API 调用的顺序来实现一致的汇聚节点排序。

More explicitly, following the following rules will cause cudaGraphExecUpdate() to pair the nodes in the original graph and the updating graph deterministically:
更明确地说,遵循以下规则将导致 cudaGraphExecUpdate() 确定性地将原始图中的节点与更新图中的节点配对:

  1. For any capturing stream, the API calls operating on that stream must be made in the same order, including event wait and other api calls not directly corresponding to node creation.
    对于任何捕获流,对该流进行操作的 API 调用必须按照相同的顺序进行,包括事件等待和其他与节点创建不直接对应的 API 调用。

  2. The API calls which directly manipulate a given graph node’s incoming edges (including captured stream APIs, node add APIs, and edge addition / removal APIs) must be made in the same order. Moreover, when dependencies are specified in arrays to these APIs, the order in which the dependencies are specified inside those arrays must match.
    直接操作给定图节点的入边的 API 调用(包括捕获流 API、节点添加 API 和边添加/移除 API)必须按照相同的顺序进行。此外,当在这些 API 的数组中指定依赖关系时,依赖关系在这些数组中指定的顺序必须匹配。

  3. Sink nodes must be consistently ordered. Sink nodes are nodes without dependent nodes / outgoing edges in the final graph at the time of the cudaGraphExecUpdate() invocation. The following operations affect sink node ordering (if present) and must (as a combined set) be made in the same order:
    汇聚节点必须保持一致的顺序。 汇聚节点是在 cudaGraphExecUpdate() 调用时在最终图中没有依赖节点/出边的节点。 以下操作会影响汇聚节点的顺序(如果存在),并且必须(作为一个组合集)按相同顺序执行:

    • Node add APIs resulting in a sink node.
      节点添加 API 导致一个汇聚节点。

    • Edge removal resulting in a node becoming a sink node.
      边缘移除导致节点变成汇聚节点。

    • cudaStreamUpdateCaptureDependencies(), if it removes a sink node from a capturing stream’s dependency set.
      cudaStreamUpdateCaptureDependencies() ,如果从捕获流的依赖集中移除一个接收节点。

    • cudaStreamEndCapture().

The following example shows how the API could be used to update an instantiated graph:
以下示例显示了如何使用 API 更新已实例化的图表:

cudaGraphExec_t graphExec = NULL;

for (int i = 0; i < 10; i++) {
    cudaGraph_t graph;
    cudaGraphExecUpdateResult updateResult;
    cudaGraphNode_t errorNode;

    // In this example we use stream capture to create the graph.
    // You can also use the Graph API to produce a graph.
    cudaStreamBeginCapture(stream, cudaStreamCaptureModeGlobal);

    // Call a user-defined, stream based workload, for example
    do_cuda_work(stream);

    cudaStreamEndCapture(stream, &graph);

    // If we've already instantiated the graph, try to update it directly
    // and avoid the instantiation overhead
    if (graphExec != NULL) {
        // If the graph fails to update, errorNode will be set to the
        // node causing the failure and updateResult will be set to a
        // reason code.
        cudaGraphExecUpdate(graphExec, graph, &errorNode, &updateResult);
    }

    // Instantiate during the first iteration or whenever the update
    // fails for any reason
    if (graphExec == NULL || updateResult != cudaGraphExecUpdateSuccess) {

        // If a previous update failed, destroy the cudaGraphExec_t
        // before re-instantiating it
        if (graphExec != NULL) {
            cudaGraphExecDestroy(graphExec);
        }
        // Instantiate graphExec from graph. The error node and
        // error message parameters are unused here.
        cudaGraphInstantiate(&graphExec, graph, NULL, NULL, 0);
    }

    cudaGraphDestroy(graph);
    cudaGraphLaunch(graphExec, stream);
    cudaStreamSynchronize(stream);
}

A typical workflow is to create the initial cudaGraph_t using either the stream capture or graph API. The cudaGraph_t is then instantiated and launched as normal. After the initial launch, a new cudaGraph_t is created using the same method as the initial graph and cudaGraphExecUpdate() is called. If the graph update is successful, indicated by the updateResult parameter in the above example, the updated cudaGraphExec_t is launched. If the update fails for any reason, the cudaGraphExecDestroy() and cudaGraphInstantiate() are called to destroy the original cudaGraphExec_t and instantiate a new one.
典型的工作流程是使用流捕获或图形 API 创建初始 cudaGraph_t 。然后, cudaGraph_t 被实例化并像平常一样启动。在初始启动之后,使用与初始图形相同的方法创建一个新的 cudaGraph_t 并调用 cudaGraphExecUpdate() 。如果图形更新成功,在上面的示例中由 updateResult 参数指示,更新后的 cudaGraphExec_t 将被启动。如果由于任何原因更新失败,则会调用 cudaGraphExecDestroy()cudaGraphInstantiate() 来销毁原始 cudaGraphExec_t 并实例化一个新的。

It is also possible to update the cudaGraph_t nodes directly (i.e., Using cudaGraphKernelNodeSetParams()) and subsequently update the cudaGraphExec_t, however it is more efficient to use the explicit node update APIs covered in the next section.
也可以直接更新 cudaGraph_t 节点(即,使用 cudaGraphKernelNodeSetParams() ),然后更新 cudaGraphExec_t ,但更高效的方法是使用下一节中介绍的显式节点更新 API。

Conditional handle flags and default values are updated as part of the graph update.
条件处理标志和默认值将作为图更新的一部分进行更新。

Please see the Graph API for more information on usage and current limitations.
请查看图形 API 以获取有关使用方法和当前限制的更多信息。

3.2.8.7.5.3. Individual node update
3.2.8.7.5.3. 单个节点更新 

Instantiated graph node parameters can be updated directly. This eliminates the overhead of instantiation as well as the overhead of creating a new cudaGraph_t. If the number of nodes requiring update is small relative to the total number of nodes in the graph, it is better to update the nodes individually. The following methods are available for updating cudaGraphExec_t nodes:
实例化的图节点参数可以直接更新。这消除了实例化的开销以及创建新 cudaGraph_t 的开销。如果需要更新的节点数量相对于图中节点总数较小,则最好逐个更新节点。以下方法可用于更新 cudaGraphExec_t 节点:

  • cudaGraphExecKernelNodeSetParams()

  • cudaGraphExecMemcpyNodeSetParams()

  • cudaGraphExecMemsetNodeSetParams()

  • cudaGraphExecHostNodeSetParams()

  • cudaGraphExecChildGraphNodeSetParams()

  • cudaGraphExecEventRecordNodeSetEvent()

  • cudaGraphExecEventWaitNodeSetEvent()

  • cudaGraphExecExternalSemaphoresSignalNodeSetParams()

  • cudaGraphExecExternalSemaphoresWaitNodeSetParams()

Please see the Graph API for more information on usage and current limitations.
请查看图形 API 以获取有关使用方法和当前限制的更多信息。

3.2.8.7.5.4. Individual node enable
3.2.8.7.5.4. 单个节点启用 

Kernel, memset and memcpy nodes in an instantiated graph can be enabled or disabled using the cudaGraphNodeSetEnabled() API. This allows the creation of a graph which contains a superset of the desired functionality which can be customized for each launch. The enable state of a node can be queried using the cudaGraphNodeGetEnabled() API.
在实例化图中,可以使用 cudaGraphNodeSetEnabled() API 启用或禁用内核、memset 和 memcpy 节点。这允许创建一个包含所需功能的超集的图,可以为每次启动进行自定义。可以使用 cudaGraphNodeGetEnabled() API 查询节点的启用状态。

A disabled node is functionally equivalent to empty node until it is reenabled. Node parameters are not affected by enabling/disabling a node. Enable state is unaffected by individual node update or whole graph update with cudaGraphExecUpdate(). Parameter updates while the node is disabled will take effect when the node is reenabled.
禁用的节点在重新启用之前在功能上等同于空节点。节点参数不受启用/禁用节点的影响。启用状态不受单个节点更新或使用 cudaGraphExecUpdate()更新整个图的影响。在节点被禁用时进行参数更新,将在节点重新启用时生效。

The following methods are available for enabling/disabling cudaGraphExec_t nodes, as well as querying their status :
以下方法可用于启用/禁用 cudaGraphExec_t 节点,以及查询其状态:

  • cudaGraphNodeSetEnabled()

  • cudaGraphNodeGetEnabled()

Please see the Graph API for more information on usage and current limitations.
请查看图形 API 以获取有关使用方法和当前限制的更多信息。

3.2.8.7.6. Using Graph APIs
3.2.8.7.6. 使用图形 API 

cudaGraph_t objects are not thread-safe. It is the responsibility of the user to ensure that multiple threads do not concurrently access the same cudaGraph_t.
cudaGraph_t 对象不是线程安全的。用户有责任确保多个线程不会同时访问相同的 cudaGraph_t

A cudaGraphExec_t cannot run concurrently with itself. A launch of a cudaGraphExec_t will be ordered after previous launches of the same executable graph.
一个 cudaGraphExec_t 不能与自身同时运行。在同一可执行图的先前启动之后,将按顺序启动 cudaGraphExec_t

Graph execution is done in streams for ordering with other asynchronous work. However, the stream is for ordering only; it does not constrain the internal parallelism of the graph, nor does it affect where graph nodes execute.
图执行是在流中进行的,以便与其他异步工作排序。但是,该流仅用于排序;它不限制图的内部并行性,也不影响图节点的执行位置。

See Graph API. 查看图形 API。

3.2.8.7.7. Device Graph Launch
3.2.8.7.7. 设备图启动 

There are many workflows which need to make data-dependent decisions during runtime and execute different operations depending on those decisions. Rather than offloading this decision-making process to the host, which may require a round-trip from the device, users may prefer to perform it on the device. To that end, CUDA provides a mechanism to launch graphs from the device.
在运行时需要进行数据相关决策并根据这些决策执行不同操作的许多工作流程。用户可能更喜欢在设备上执行此决策过程,而不是将其转移到主机,这可能需要从设备进行往返。为此,CUDA 提供了一种从设备启动图形的机制。

Device graph launch provides a convenient way to perform dynamic control flow from the device, be it something as simple as a loop or as complex as a device-side work scheduler. This functionality is only available on systems which support unified addressing.
设备图启动提供了一种方便的方式,可以从设备执行动态控制流,无论是简单的循环还是复杂的设备端工作调度器。此功能仅适用于支持统一寻址的系统。

Graphs which can be launched from the device will henceforth be referred to as device graphs, and graphs which cannot be launched from the device will be referred to as host graphs.
从设备启动的图形将从现在起被称为设备图形,而无法从设备启动的图形将被称为主机图形。

Device graphs can be launched from both the host and device, whereas host graphs can only be launched from the host. Unlike host launches, launching a device graph from the device while a previous launch of the graph is running will result in an error, returning cudaErrorInvalidValue; therefore, a device graph cannot be launched twice from the device at the same time. Launching a device graph from the host and device simultaneously will result in undefined behavior.
设备图可以从主机和设备启动,而主机图只能从主机启动。与主机启动不同,当图的先前启动正在运行时,从设备启动设备图将导致错误,返回 cudaErrorInvalidValue ;因此,设备图不能同时从设备两次启动。同时从主机和设备启动设备图将导致未定义行为。

3.2.8.7.7.1. Device Graph Creation
3.2.8.7.7.1. 设备图创建 

In order for a graph to be launched from the device, it must be instantiated explicitly for device launch. This is achieved by passing the cudaGraphInstantiateFlagDeviceLaunch flag to the cudaGraphInstantiate() call. As is the case for host graphs, device graph structure is fixed at time of instantiation and cannot be updated without re-instantiation, and instantiation can only be performed on the host. In order for a graph to be able to be instantiated for device launch, it must adhere to various requirements.
为了从设备上启动图形,必须明确为设备启动实例化图形。这是通过向 cudaGraphInstantiate() 调用传递 cudaGraphInstantiateFlagDeviceLaunch 标志来实现的。与主机图形一样,在实例化时设备图形结构是固定的,不能在不重新实例化的情况下更新,实例化只能在主机上执行。为了使图形能够为设备启动实例化,必须遵守各种要求。

3.2.8.7.7.1.1. Device Graph Requirements
3.2.8.7.7.1.1. 设备图要求 

General requirements: 通用要求:

  • The graph’s nodes must all reside on a single device.
    图的节点必须全部驻留在单个设备上。

  • The graph can only contain kernel nodes, memcpy nodes, memset nodes, and child graph nodes.
    图表只能包含内核节点、memcpy 节点、memset 节点和子图节点。

Kernel nodes: 内核节点:

  • Use of CUDA Dynamic Parallelism by kernels in the graph is not permitted.
    图中内核不允许使用 CUDA 动态并行。

  • Cooperative launches are permitted so long as MPS is not in use.
    合作启动是允许的,只要未使用 MPS。

Memcpy nodes: 复制节点:

  • Only copies involving device memory and/or pinned device-mapped host memory are permitted.
    只允许涉及设备内存和/或固定设备映射主机内存的副本。

  • Copies involving CUDA arrays are not permitted.
    不允许涉及 CUDA 数组的拷贝。

  • Both operands must be accessible from the current device at time of instantiation. Note that the copy operation will be performed from the device on which the graph resides, even if it is targeting memory on another device.
    两个操作数在实例化时必须从当前设备可访问。请注意,即使目标内存位于另一设备上,复制操作也将从图所在的设备上执行。

3.2.8.7.7.1.2. Device Graph Upload
3.2.8.7.7.1.2. 设备图上传 

In order to launch a graph on the device, it must first be uploaded to the device to populate the necessary device resources. This can be achieved in one of two ways.
为了在设备上启动图形,必须首先将其上传到设备以填充必要的设备资源。可以通过以下两种方式之一实现这一目标。

Firstly, the graph can be uploaded explicitly, either via cudaGraphUpload() or by requesting an upload as part of instantiation via cudaGraphInstantiateWithParams().
首先,图可以通过 cudaGraphUpload() 明确上传,也可以通过在实例化过程中请求上传来上传。

Alternatively, the graph can first be launched from the host, which will perform this upload step implicitly as part of the launch.
或者,可以首先从主机启动图形,这将作为启动的一部分隐式执行此上传步骤。

Examples of all three methods can be seen below:
以下是三种方法的示例:

// Explicit upload after instantiation
cudaGraphInstantiate(&deviceGraphExec1, deviceGraph1, cudaGraphInstantiateFlagDeviceLaunch);
cudaGraphUpload(deviceGraphExec1, stream);

// Explicit upload as part of instantiation
cudaGraphInstantiateParams instantiateParams = {0};
instantiateParams.flags = cudaGraphInstantiateFlagDeviceLaunch | cudaGraphInstantiateFlagUpload;
instantiateParams.uploadStream = stream;
cudaGraphInstantiateWithParams(&deviceGraphExec2, deviceGraph2, &instantiateParams);

// Implicit upload via host launch
cudaGraphInstantiate(&deviceGraphExec3, deviceGraph3, cudaGraphInstantiateFlagDeviceLaunch);
cudaGraphLaunch(deviceGraphExec3, stream);
3.2.8.7.7.1.3. Device Graph Update
3.2.8.7.7.1.3. 设备图更新 

Device graphs can only be updated from the host, and must be re-uploaded to the device upon executable graph update in order for the changes to take effect. This can be achieved using the same methods outlined in the previous section. Unlike host graphs, launching a device graph from the device while an update is being applied will result in undefined behavior.
设备图只能从主机更新,并且必须在可执行图更新后重新上传到设备,以使更改生效。这可以通过使用前一节中概述的相同方法实现。与主机图不同,在应用更新时从设备启动设备图将导致未定义行为。

3.2.8.7.7.2. Device Launch
3.2.8.7.7.2. 设备启动 

Device graphs can be launched from both the host and the device via cudaGraphLaunch(), which has the same signature on the device as on the host. Device graphs are launched via the same handle on the host and the device. Device graphs must be launched from another graph when launched from the device.
设备图可以从主机和设备上启动,通过 cudaGraphLaunch() ,在设备上与主机上的签名相同。设备图通过主机和设备上的相同句柄启动。从设备上启动时,设备图必须从另一个图中启动。

Device-side graph launch is per-thread and multiple launches may occur from different threads at the same time, so the user will need to select a single thread from which to launch a given graph.
设备端图形启动是每个线程的,不同线程可能同时启动多个图形,因此用户需要选择一个线程来启动给定的图形。

3.2.8.7.7.2.1. Device Launch Modes
3.2.8.7.7.2.1. 设备启动模式 

Unlike host launch, device graphs cannot be launched into regular CUDA streams, and can only be launched into distinct named streams, which each denote a specific launch mode:
与主机启动不同,设备图无法启动到常规的 CUDA 流中,只能启动到不同的命名流中,每个流表示特定的启动模式:

Table 2 Device-only Graph Launch Streams
表 2 仅设备图表启动流

Stream 

Launch Mode 启动模式

cudaStreamGraphFireAndForget

Fire and forget launch 点火并忘记启动

cudaStreamGraphTailLaunch

Tail launch 尾部启动

cudaStreamGraphFireAndForgetAsSibling

Sibling launch 兄弟启动

3.2.8.7.7.2.1.1. Fire and Forget Launch
3.2.8.7.7.2.1.1. 火而忘发射 

As the name suggests, a fire and forget launch is submitted to the GPU immediately, and it runs independently of the launching graph. In a fire-and-forget scenario, the launching graph is the parent, and the launched graph is the child.
正如其名称所示,fire and forget 启动会立即提交到 GPU,并且独立于启动图运行。在 fire-and-forget 场景中,启动图是父图,而被启动的图是子图。

_images/fire-and-forget-simple.png

Figure 15 Fire and forget launch
图 15 火炬式启动 

The above diagram can be generated by the sample code below:
上面的图表可以通过下面的示例代码生成:

__global__ void launchFireAndForgetGraph(cudaGraphExec_t graph) {
    cudaGraphLaunch(graph, cudaStreamGraphFireAndForget);
}

void graphSetup() {
    cudaGraphExec_t gExec1, gExec2;
    cudaGraph_t g1, g2;

    // Create, instantiate, and upload the device graph.
    create_graph(&g2);
    cudaGraphInstantiate(&gExec2, g2, cudaGraphInstantiateFlagDeviceLaunch);
    cudaGraphUpload(gExec2, stream);

    // Create and instantiate the launching graph.
    cudaStreamBeginCapture(stream, cudaStreamCaptureModeGlobal);
    launchFireAndForgetGraph<<<1, 1, 0, stream>>>(gExec2);
    cudaStreamEndCapture(stream, &g1);
    cudaGraphInstantiate(&gExec1, g1);

    // Launch the host graph, which will in turn launch the device graph.
    cudaGraphLaunch(gExec1, stream);
}

A graph can have up to 120 total fire-and-forget graphs during the course of its execution. This total resets between launches of the same parent graph.
图表在执行过程中最多可以有 120 个总的 fire-and-forget 图表。这个总数在相同父图的启动之间重置。

3.2.8.7.7.2.1.2. Graph Execution Environments
3.2.8.7.7.2.1.2. 图执行环境 

In order to fully understand the device-side synchronization model, it is first necessary to understand the concept of an execution environment.
为了充分理解设备端同步模型,首先需要理解执行环境的概念。

When a graph is launched from the device, it is launched into its own execution environment. The execution environment of a given graph encapsulates all work in the graph as well as all generated fire and forget work. The graph can be considered complete when it has completed execution and when all generated child work is complete.
当从设备启动图时,它将被启动到自己的执行环境中。给定图的执行环境封装了图中的所有工作以及所有生成的 fire and forget 工作。当图完成执行并且所有生成的子工作完成时,可以认为图已经完成。

The below diagram shows the environment encapsulation that would be generated by the fire-and-forget sample code in the previous section.
下图显示了在上一节中的 fire-and-forget 示例代码生成的环境封装。

_images/fire-and-forget-environments.png

Figure 16 Fire and forget launch, with execution environments
图 16 火而忘启动,带有执行环境 

These environments are also hierarchical, so a graph environment can include multiple levels of child-environments from fire and forget launches.
这些环境也是分层的,因此图形环境可以包含多个级别的子环境,从一次性启动到忘记。

_images/fire-and-forget-nested-environments.png

Figure 17 Nested fire and forget environments
图 17 嵌套的 fire and forget 环境 

When a graph is launched from the host, there exists a stream environment that parents the execution environment of the launched graph. The stream environment encapsulates all work generated as part of the overall launch. The stream launch is complete (i.e. downstream dependent work may now run) when the overall stream environment is marked as complete.
当从主机启动图形时,存在一个流环境,该环境是启动图形的执行环境的父级。流环境封装了作为整体启动的一部分生成的所有工作。当整体流环境标记为完成时,流启动完成(即下游依赖工作现在可以运行)。

_images/device-graph-stream-environment.png

Figure 18 The stream environment, visualized
图 18 流环境,可视化 

3.2.8.7.7.2.1.3. Tail Launch
3.2.8.7.7.2.1.3. 尾部启动 

Unlike on the host, it is not possible to synchronize with device graphs from the GPU via traditional methods such as cudaDeviceSynchronize() or cudaStreamSynchronize(). Rather, in order to enable serial work dependencies, a different launch mode - tail launch - is offered, to provide similar functionality.
与主机不同,无法通过传统方法(如 cudaDeviceSynchronize()cudaStreamSynchronize() )与 GPU 上的设备图形同步。相反,为了启用串行工作依赖关系,提供了一种不同的启动模式 - 尾部启动 - 以提供类似功能。

A tail launch executes when a graph’s environment is considered complete - ie, when the graph and all its children are complete. When a graph completes, the environment of the next graph in the tail launch list will replace the completed environment as a child of the parent environment. Like fire-and-forget launches, a graph can have multiple graphs enqueued for tail launch.
尾部启动在图的环境被视为完整时执行 - 即,当图及其所有子图均完成时。当一个图完成时,尾部启动列表中下一个图的环境将取代已完成环境成为父环境的子环境。与“发射并忘记”不同,一个图可以有多个图被排队进行尾部启动。

_images/tail-launch-simple.png

Figure 19 A simple tail launch
图 19 一个简单的尾部发射 

The above execution flow can be generated by the code below:
上述执行流程可以由以下代码生成:

__global__ void launchTailGraph(cudaGraphExec_t graph) {
    cudaGraphLaunch(graph, cudaStreamGraphTailLaunch);
}

void graphSetup() {
    cudaGraphExec_t gExec1, gExec2;
    cudaGraph_t g1, g2;

    // Create, instantiate, and upload the device graph.
    create_graph(&g2);
    cudaGraphInstantiate(&gExec2, g2, cudaGraphInstantiateFlagDeviceLaunch);
    cudaGraphUpload(gExec2, stream);

    // Create and instantiate the launching graph.
    cudaStreamBeginCapture(stream, cudaStreamCaptureModeGlobal);
    launchTailGraph<<<1, 1, 0, stream>>>(gExec2);
    cudaStreamEndCapture(stream, &g1);
    cudaGraphInstantiate(&gExec1, g1);

    // Launch the host graph, which will in turn launch the device graph.
    cudaGraphLaunch(gExec1, stream);
}

Tail launches enqueued by a given graph will execute one at a time, in order of when they were enqueued. So the first enqueued graph will run first, and then the second, and so on.
给定图形启动的尾部启动将按顺序一个接一个地执行,按照它们被入队的顺序。因此,第一个入队的图形将首先运行,然后是第二个,依此类推。

_images/tail-launch-ordering-simple.png

Figure 20 Tail launch ordering
图 20 尾部启动顺序 

Tail launches enqueued by a tail graph will execute before tail launches enqueued by previous graphs in the tail launch list. These new tail launches will execute in the order they are enqueued.
由尾图启动的尾部启动将在尾部启动列表中之前图形排队的尾部启动之前执行。这些新的尾部启动将按照它们被排队的顺序执行。

_images/tail-launch-ordering-complex.png

Figure 21 Tail launch ordering when enqueued from multiple graphs
图 21 当从多个图中排队时的尾部启动顺序 

A graph can have up to 255 pending tail launches.
图表最多可以有 255 个待处理的尾部启动。

3.2.8.7.7.2.1.3.1. Tail Self-launch
3.2.8.7.7.2.1.3.1. 尾部自启动 

It is possible for a device graph to enqueue itself for a tail launch, although a given graph can only have one self-launch enqueued at a time. In order to query the currently running device graph so that it can be relaunched, a new device-side function is added:
设备图形可以将自身排队进行尾部启动,尽管给定图形一次只能有一个自启动排队。为了查询当前正在运行的设备图形,以便可以重新启动它,添加了一个新的设备端函数:

cudaGraphExec_t cudaGetCurrentGraphExec();

This function returns the handle of the currently running graph if it is a device graph. If the currently executing kernel is not a node within a device graph, this function will return NULL.
此函数返回当前正在运行的图的句柄,如果它是设备图。如果当前执行的内核不是设备图中的节点,则此函数将返回 NULL。

Below is sample code showing usage of this function for a relaunch loop:
以下是一个示例代码,展示了如何使用此函数进行重新启动循环:

__device__ int relaunchCount = 0;

__global__ void relaunchSelf() {
    int relaunchMax = 100;

    if (threadIdx.x == 0) {
        if (relaunchCount < relaunchMax) {
            cudaGraphLaunch(cudaGetCurrentGraphExec(), cudaStreamGraphTailLaunch);
        }

        relaunchCount++;
    }
}
3.2.8.7.7.2.1.4. Sibling Launch
3.2.8.7.7.2.1.4. 兄弟启动 

Sibling launch is a variation of fire-and-forget launch in which the graph is launched not as a child of the launching graph’s execution environment, but rather as a child of the launching graph’s parent environment. Sibling launch is equivalent to a fire-and-forget launch from the launching graph’s parent environment.
兄弟启动是一种忘记式启动的变体,其中图形的启动不作为启动图形的执行环境的子代启动,而是作为启动图形的父环境的子代启动。兄弟启动等同于从启动图形的父环境进行忘记式启动。

_images/sibling-launch-simple.png

Figure 22 A simple sibling launch
图 22 一个简单的兄弟启动 

The above diagram can be generated by the sample code below:
上面的图表可以通过下面的示例代码生成:

__global__ void launchSiblingGraph(cudaGraphExec_t graph) {
    cudaGraphLaunch(graph, cudaStreamGraphFireAndForgetAsSibling);
}

void graphSetup() {
    cudaGraphExec_t gExec1, gExec2;
    cudaGraph_t g1, g2;

    // Create, instantiate, and upload the device graph.
    create_graph(&g2);
    cudaGraphInstantiate(&gExec2, g2, cudaGraphInstantiateFlagDeviceLaunch);
    cudaGraphUpload(gExec2, stream);

    // Create and instantiate the launching graph.
    cudaStreamBeginCapture(stream, cudaStreamCaptureModeGlobal);
    launchSiblingGraph<<<1, 1, 0, stream>>>(gExec2);
    cudaStreamEndCapture(stream, &g1);
    cudaGraphInstantiate(&gExec1, g1);

    // Launch the host graph, which will in turn launch the device graph.
    cudaGraphLaunch(gExec1, stream);
}

Since sibling launches are not launched into the launching graph’s execution environment, they will not gate tail launches enqueued by the launching graph.
由于兄弟启动不会被启动图的执行环境启动,因此它们不会阻止由启动图排队的尾启动。

3.2.8.7.8. Conditional Graph Nodes
3.2.8.7.8. 条件图节点 

Conditional nodes allow conditional execution and looping of a graph contained within the conditional node. This allows dynamic and iterative workflows to be represented completely within a graph and frees up the host CPU to perform other work in parallel.
条件节点允许在条件节点内包含的图形进行条件执行和循环。这使得动态和迭代工作流能够完全在图形中表示,并释放主机 CPU 以便并行执行其他工作。

Evaluation of the condition value is performed on the device when the dependencies of the conditional node have been met. Conditional nodes can be one of the following types:
在满足条件节点的依赖关系时,在设备上执行条件值的评估。条件节点可以是以下类型之一:

  • Conditional IF nodes execute their body graph once if the condition value is non-zero when the node is executed.
    条件 IF 节点在执行时,如果条件值为非零,则执行其主体图形一次。

  • Conditional WHILE nodes execute their body graph if the condition value is non-zero when the node is executed and will continue to execute their body graph until the condition value is zero.
    条件 WHILE 节点在节点执行时,如果条件值为非零,则执行其主体图,并将继续执行其主体图,直到条件值为零。

A condition value is accessed by a conditional handle , which must be created before the node. The condition value can be set by device code using cudaGraphSetConditional(). A default value, applied on each graph launch, can also be specified when the handle is created.
条件值通过条件句柄访问,该句柄必须在节点之前创建。设备代码可以使用 cudaGraphSetConditional() 设置条件值。在创建句柄时还可以指定在每次图启动时应用的默认值。

When the conditional node is created, an empty graph is created and the handle is returned to the user so that the graph can be populated. This conditional body graph can be populated using either the graph APIs or cudaStreamBeginCaptureToGraph() .
当创建条件节点时,会创建一个空图,并将句柄返回给用户,以便用户可以填充图形。可以使用图形 API 或 cudaStreamBeginCaptureToGraph() 来填充此条件体图。

Conditional nodes can be nested.
条件节点可以嵌套。

3.2.8.7.8.1. Conditional Handles
3.2.8.7.8.1. 条件处理程序 

A condition value is represented by cudaGraphConditionalHandle and is created by cudaGraphConditionalHandleCreate().
条件值由 cudaGraphConditionalHandle 表示,并由 cudaGraphConditionalHandleCreate() 创建。

The handle must be associated with a single conditional node. Handles cannot be destroyed.
句柄必须与单个条件节点关联。句柄不能被销毁。

If cudaGraphCondAssignDefault is specified when the handle is created, the condition value will be initialized to the specified default before every graph launch. If this flag is not provided, it is up to the user to initialize the condition value in a kernel upstream of the conditional node which tests it. If the condition value is not initialized by one of these methods, its value is undefined.
如果在创建句柄时指定了 cudaGraphCondAssignDefault ,则在每次启动图形之前,条件值将被初始化为指定的默认值。如果未提供此标志,则用户需要在测试它的条件节点上游的内核中初始化条件值。如果条件值未通过这些方法之一初始化,则其值是未定义的。

The default value and flags associated with a handle will be updated during whole graph update .
句柄的默认值和标志将在整个图更新期间更新。

3.2.8.7.8.2. Condtional Node Body Graph Requirements
3.2.8.7.8.2. 条件节点主体图要求 

General requirements: 通用要求:

  • The graph’s nodes must all reside on a single device.
    图的节点必须全部驻留在单个设备上。

  • The graph can only contain kernel nodes, empty nodes, memcpy nodes, memset nodes, child graph nodes, and conditional nodes.
    图只能包含内核节点、空节点、memcpy 节点、memset 节点、子图节点和条件节点。

Kernel nodes: 内核节点:

  • Use of CUDA Dynamic Parallelism by kernels in the graph is not permitted.
    图中内核不允许使用 CUDA 动态并行。

  • Cooperative launches are permitted so long as MPS is not in use.
    合作启动是允许的,只要未使用 MPS。

Memcpy/Memset nodes: 复制/设置节点:

  • Only copies/memsets involving device memory and/or pinned device-mapped host memory are permitted.
    只允许涉及设备内存和/或固定设备映射主机内存的复制/内存设置。

  • Copies/memsets involving CUDA arrays are not permitted.
    不允许涉及 CUDA 数组的复制/内存设置。

  • Both operands must be accessible from the current device at time of instantiation. Note that the copy operation will be performed from the device on which the graph resides, even if it is targeting memory on another device.
    两个操作数在实例化时必须从当前设备可访问。请注意,即使目标内存位于另一设备上,复制操作也将从图所在的设备上执行。

3.2.8.7.8.3. Conditional IF Nodes
3.2.8.7.8.3. 条件 IF 节点 

The body graph of an IF node will be executed once if the condition is non-zero when the node is executed. The following diagram depicts a 3 node graph where the middle node, B, is a conditional node:
IF 节点的主体图将在节点执行时条件为非零时执行一次。以下图示了一个包含 3 个节点的图,其中间节点 B 是一个条件节点:

_images/conditional-if-node.png

Figure 23 Conditional IF Node
图 23 条件 IF 节点 

The following code illustrates the creation of a graph containing an IF conditional node. The default value of the condition is set using an upstream kernel. The body of the conditional is populated using the graph API .
以下代码示例了创建包含 IF 条件节点的图形。使用上游内核设置条件的默认值。使用图形 API 填充条件的主体。

__global__ void setHandle(cudaGraphConditionalHandle handle)
{
    ...
    cudaGraphSetConditional(handle, value);
    ...
}

void graphSetup() {
    cudaGraph_t graph;
    cudaGraphExec_t graphExec;
    cudaGraphNode_t node;
    void *kernelArgs[1];
    int value = 1;

    cudaGraphCreate(&graph, 0);

    cudaGraphConditionalHandle handle;
    cudaGraphConditionalHandleCreate(&handle, graph);

    // Use a kernel upstream of the conditional to set the handle value
    cudaGraphNodeParams params = { cudaGraphNodeTypeKernel };
    params.kernel.func = (void *)setHandle;
    params.kernel.gridDim.x = params.kernel.gridDim.y = params.kernel.gridDim.z = 1;
    params.kernel.blockDim.x = params.kernel.blockDim.y = params.kernel.blockDim.z = 1;
    params.kernel.kernelParams = kernelArgs;
    kernelArgs[0] = &handle;
    cudaGraphAddNode(&node, graph, NULL, 0, &params);

    cudaGraphNodeParams cParams = { cudaGraphNodeTypeConditional };
    cParams.conditional.handle = handle;
    cParams.conditional.type   = cudaGraphCondTypeIf;
    cParams.conditional.size   = 1;
    cudaGraphAddNode(&node, graph, &node, 1, &cParams);

    cudaGraph_t bodyGraph = cParams.conditional.phGraph_out[0];

    // Populate the body of the conditional node
    ...
    cudaGraphAddNode(&node, bodyGraph, NULL, 0, &params);

    cudaGraphInstantiate(&graphExec, graph, NULL, NULL, 0);
    cudaGraphLaunch(graphExec, 0);
    cudaDeviceSynchronize();

    cudaGraphExecDestroy(graphExec);
    cudaGraphDestroy(graph);
}
3.2.8.7.8.4. Conditional WHILE Nodes
3.2.8.7.8.4. 条件 WHILE 节点 

The body graph of a WHILE node will be executed until the condition is non-zero. The condition will be evaluated when the node is executed and after completion of the body graph. The following diagram depicts a 3 node graph where the middle node, B, is a conditional node:
WHILE 节点的主体图将被执行,直到条件为非零。当节点执行并完成主体图后,条件将被评估。以下图表描述了一个包含 3 个节点的图,其中间节点 B 是一个条件节点:

_images/conditional-while-node.png

Figure 24 Conditional WHILE Node
图 24 条件 WHILE 节点 

The following code illustrates the creation of a graph containing a WHILE conditional node. The handle is created using cudaGraphCondAssignDefault to avoid the need for an upstream kernel. The body of the conditional is populated using the graph API .
以下代码示例了创建一个包含 WHILE 条件节点的图形。使用 cudaGraphCondAssignDefault 创建句柄,以避免需要上游内核。条件体使用图形 API 填充。

__global__ void loopKernel(cudaGraphConditionalHandle handle)
{
    static int count = 10;
    cudaGraphSetConditional(handle, --count ? 1 : 0);
}

void graphSetup() {
    cudaGraph_t graph;
    cudaGraphExec_t graphExec;
    cudaGraphNode_t node;
    void *kernelArgs[1];

    cuGraphCreate(&graph, 0);

    cudaGraphConditionalHandle handle;
    cudaGraphConditionalHandleCreate(&handle, graph, 1, cudaGraphCondAssignDefault);

    cudaGraphNodeParams cParams = { cudaGraphNodeTypeConditional };
    cParams.conditional.handle = handle;
    cParams.conditional.type   = cudaGraphCondTypeWhile;
    cParams.conditional.size   = 1;
    cudaGraphAddNode(&node, graph, NULL, 0, &cParams);

    cudaGraph_t bodyGraph = cParams.conditional.phGraph_out[0];

    cudaGraphNodeParams params = { cudaGraphNodeTypeKernel };
    params.kernel.func = (void *)loopKernel;
    params.kernel.gridDim.x = params.kernel.gridDim.y = params.kernel.gridDim.z = 1;
    params.kernel.blockDim.x = params.kernel.blockDim.y = params.kernel.blockDim.z = 1;
    params.kernel.kernelParams = kernelArgs;
    kernelArgs[0] = &handle;
    cudaGraphAddNode(&node, bodyGraph, NULL, 0, &params);

    cudaGraphInstantiate(&graphExec, graph, NULL, NULL, 0);
    cudaGraphLaunch(graphExec, 0);
    cudaDeviceSynchronize();

    cudaGraphExecDestroy(graphExec);
    cudaGraphDestroy(graph);
}

3.2.8.8. Events 3.2.8.8. 事件 

The runtime also provides a way to closely monitor the device’s progress, as well as perform accurate timing, by letting the application asynchronously record events at any point in the program, and query when these events are completed. An event has completed when all tasks - or optionally, all commands in a given stream - preceding the event have completed. Events in stream zero are completed after all preceding tasks and commands in all streams are completed.
运行时还提供了一种密切监视设备进度的方式,以及通过让应用程序异步记录程序中的任何点的事件,并查询这些事件何时完成,从而执行准确的计时。当所有任务 - 或者在给定流中的所有命令 - 在事件之前完成时,事件已完成。在流零中的事件在所有流中的所有先前任务和命令完成后完成。

3.2.8.8.1. Creation and Destruction of Events
3.2.8.8.1. 事件的创建和销毁 

The following code sample creates two events:
以下代码示例创建了两个事件:

cudaEvent_t start, stop;
cudaEventCreate(&start);
cudaEventCreate(&stop);

They are destroyed this way:
它们以这种方式被销毁:

cudaEventDestroy(start);
cudaEventDestroy(stop);
3.2.8.8.2. Elapsed Time
3.2.8.8.2. 经过的时间 

The events created in Creation and Destruction can be used to time the code sample of Creation and Destruction the following way:
在创建和销毁中创建的事件可用于计时创建和销毁代码示例的以下方式:

cudaEventRecord(start, 0);
for (int i = 0; i < 2; ++i) {
    cudaMemcpyAsync(inputDev + i * size, inputHost + i * size,
                    size, cudaMemcpyHostToDevice, stream[i]);
    MyKernel<<<100, 512, 0, stream[i]>>>
               (outputDev + i * size, inputDev + i * size, size);
    cudaMemcpyAsync(outputHost + i * size, outputDev + i * size,
                    size, cudaMemcpyDeviceToHost, stream[i]);
}
cudaEventRecord(stop, 0);
cudaEventSynchronize(stop);
float elapsedTime;
cudaEventElapsedTime(&elapsedTime, start, stop);

3.2.8.9. Synchronous Calls
3.2.8.9. 同步调用 

When a synchronous function is called, control is not returned to the host thread before the device has completed the requested task. Whether the host thread will then yield, block, or spin can be specified by calling cudaSetDeviceFlags()with some specific flags (see reference manual for details) before any other CUDA call is performed by the host thread.
当调用同步函数时,在设备完成请求的任务之前,控制不会返回给主机线程。主机线程接下来会让出、阻塞或自旋,可以通过在主机线程执行任何其他 CUDA 调用之前调用 cudaSetDeviceFlags() 并使用一些特定标志来指定(有关详细信息,请参阅参考手册)。

3.2.9. Multi-Device System
3.2.9. 多设备系统 

3.2.9.1. Device Enumeration
3.2.9.1. 设备枚举 

A host system can have multiple devices. The following code sample shows how to enumerate these devices, query their properties, and determine the number of CUDA-enabled devices.
主机系统可以拥有多个设备。以下代码示例显示了如何枚举这些设备,查询它们的属性,并确定支持 CUDA 的设备数量。

int deviceCount;
cudaGetDeviceCount(&deviceCount);
int device;
for (device = 0; device < deviceCount; ++device) {
    cudaDeviceProp deviceProp;
    cudaGetDeviceProperties(&deviceProp, device);
    printf("Device %d has compute capability %d.%d.\n",
           device, deviceProp.major, deviceProp.minor);
}

3.2.9.2. Device Selection
3.2.9.2. 设备选择 

A host thread can set the device it operates on at any time by calling cudaSetDevice(). Device memory allocations and kernel launches are made on the currently set device; streams and events are created in association with the currently set device. If no call to cudaSetDevice() is made, the current device is device 0.
主机线程可以随时通过调用 cudaSetDevice() 来设置其操作的设备。设备内存分配和内核启动是在当前设置的设备上进行的;流和事件是与当前设置的设备关联创建的。如果没有调用 cudaSetDevice() ,当前设备是设备 0。

The following code sample illustrates how setting the current device affects memory allocation and kernel execution.
以下代码示例说明了设置当前设备如何影响内存分配和内核执行。

size_t size = 1024 * sizeof(float);
cudaSetDevice(0);            // Set device 0 as current
float* p0;
cudaMalloc(&p0, size);       // Allocate memory on device 0
MyKernel<<<1000, 128>>>(p0); // Launch kernel on device 0
cudaSetDevice(1);            // Set device 1 as current
float* p1;
cudaMalloc(&p1, size);       // Allocate memory on device 1
MyKernel<<<1000, 128>>>(p1); // Launch kernel on device 1

3.2.9.3. Stream and Event Behavior
3.2.9.3. 流和事件行为 

A kernel launch will fail if it is issued to a stream that is not associated to the current device as illustrated in the following code sample.
如果将内核启动发出到与当前设备不相关联的流,则将失败,如下面的代码示例所示。

cudaSetDevice(0);               // Set device 0 as current
cudaStream_t s0;
cudaStreamCreate(&s0);          // Create stream s0 on device 0
MyKernel<<<100, 64, 0, s0>>>(); // Launch kernel on device 0 in s0
cudaSetDevice(1);               // Set device 1 as current
cudaStream_t s1;
cudaStreamCreate(&s1);          // Create stream s1 on device 1
MyKernel<<<100, 64, 0, s1>>>(); // Launch kernel on device 1 in s1

// This kernel launch will fail:
MyKernel<<<100, 64, 0, s0>>>(); // Launch kernel on device 1 in s0

A memory copy will succeed even if it is issued to a stream that is not associated to the current device.
即使将内存复制到与当前设备不相关联的流,也会成功。

cudaEventRecord() will fail if the input event and input stream are associated to different devices.
如果输入事件和输入流关联到不同的设备, cudaEventRecord() 将失败。

cudaEventElapsedTime() will fail if the two input events are associated to different devices.
如果两个输入事件关联到不同设备, cudaEventElapsedTime() 将失败。

cudaEventSynchronize() and cudaEventQuery() will succeed even if the input event is associated to a device that is different from the current device.
cudaEventSynchronize()cudaEventQuery() 将成功,即使输入事件与当前设备不同。

cudaStreamWaitEvent() will succeed even if the input stream and input event are associated to different devices. cudaStreamWaitEvent() can therefore be used to synchronize multiple devices with each other.
cudaStreamWaitEvent() 即使输入流和输入事件与不同设备关联,也会成功。因此, cudaStreamWaitEvent() 可用于将多个设备与彼此同步。

Each device has its own default stream (see Default Stream), so commands issued to the default stream of a device may execute out of order or concurrently with respect to commands issued to the default stream of any other device.
每个设备都有自己的默认流(请参阅默认流),因此对设备的默认流发出的命令可能会与对任何其他设备的默认流发出的命令无序执行或并发执行。

3.2.9.4. Peer-to-Peer Memory Access
3.2.9.4. 点对点内存访问 

Depending on the system properties, specifically the PCIe and/or NVLINK topology, devices are able to address each other’s memory (i.e., a kernel executing on one device can dereference a pointer to the memory of the other device). This peer-to-peer memory access feature is supported between two devices if cudaDeviceCanAccessPeer() returns true for these two devices.
根据系统属性,特别是 PCIe 和/或 NVLINK 拓扑结构,设备能够访问彼此的内存(即,在一个设备上执行的内核可以对另一个设备的内存进行解引用)。如果这两个设备的 cudaDeviceCanAccessPeer() 返回 true,则支持这两个设备之间的对等内存访问功能。

Peer-to-peer memory access is only supported in 64-bit applications and must be enabled between two devices by calling cudaDeviceEnablePeerAccess() as illustrated in the following code sample. On non-NVSwitch enabled systems, each device can support a system-wide maximum of eight peer connections.
点对点内存访问仅支持 64 位应用程序,并且必须通过调用 cudaDeviceEnablePeerAccess() 在两个设备之间启用,如下面的代码示例所示。在非 NVSwitch 启用的系统上,每个设备最多可以支持系统范围内的八个对等连接。

A unified address space is used for both devices (see Unified Virtual Address Space), so the same pointer can be used to address memory from both devices as shown in the code sample below.
统一的地址空间用于设备(请参阅统一虚拟地址空间),因此可以使用相同的指针来访问来自两个设备的内存,如下面的代码示例所示。

cudaSetDevice(0);                   // Set device 0 as current
float* p0;
size_t size = 1024 * sizeof(float);
cudaMalloc(&p0, size);              // Allocate memory on device 0
MyKernel<<<1000, 128>>>(p0);        // Launch kernel on device 0
cudaSetDevice(1);                   // Set device 1 as current
cudaDeviceEnablePeerAccess(0, 0);   // Enable peer-to-peer access
                                    // with device 0

// Launch kernel on device 1
// This kernel launch can access memory on device 0 at address p0
MyKernel<<<1000, 128>>>(p0);
3.2.9.4.1. IOMMU on Linux
3.2.9.4.1. Linux 上的 IOMMU 

On Linux only, CUDA and the display driver does not support IOMMU-enabled bare-metal PCIe peer to peer memory copy. However, CUDA and the display driver does support IOMMU via VM pass through. As a consequence, users on Linux, when running on a native bare metal system, should disable the IOMMU. The IOMMU should be enabled and the VFIO driver be used as a PCIe pass through for virtual machines.
仅在 Linux 上,CUDA 和显示驱动程序不支持启用 IOMMU 的裸机 PCIe 点对点内存复制。但是,CUDA 和显示驱动程序支持通过 VM 透传来支持 IOMMU。因此,在 Linux 上运行在本机裸机系统上的用户应禁用 IOMMU。IOMMU 应启用,并且应使用 VFIO 驱动程序作为虚拟机的 PCIe 透传。

On Windows the above limitation does not exist.
在 Windows 上不存在上述限制。

See also Allocating DMA Buffers on 64-bit Platforms.
另请参阅在 64 位平台上分配 DMA 缓冲区。

3.2.9.5. Peer-to-Peer Memory Copy
3.2.9.5. 点对点内存复制 

Memory copies can be performed between the memories of two different devices.
内存复制可以在两个不同设备的内存之间执行。

When a unified address space is used for both devices (see Unified Virtual Address Space), this is done using the regular memory copy functions mentioned in Device Memory.
当统一地址空间用于两个设备时(请参阅统一虚拟地址空间),可以使用在设备内存中提到的常规内存复制函数来完成。

Otherwise, this is done using cudaMemcpyPeer(), cudaMemcpyPeerAsync(), cudaMemcpy3DPeer(), or cudaMemcpy3DPeerAsync() as illustrated in the following code sample.
否则,可以使用 cudaMemcpyPeer()cudaMemcpyPeerAsync()cudaMemcpy3DPeer()cudaMemcpy3DPeerAsync() ,如下面的代码示例所示。

cudaSetDevice(0);                   // Set device 0 as current
float* p0;
size_t size = 1024 * sizeof(float);
cudaMalloc(&p0, size);              // Allocate memory on device 0
cudaSetDevice(1);                   // Set device 1 as current
float* p1;
cudaMalloc(&p1, size);              // Allocate memory on device 1
cudaSetDevice(0);                   // Set device 0 as current
MyKernel<<<1000, 128>>>(p0);        // Launch kernel on device 0
cudaSetDevice(1);                   // Set device 1 as current
cudaMemcpyPeer(p1, 1, p0, 0, size); // Copy p0 to p1
MyKernel<<<1000, 128>>>(p1);        // Launch kernel on device 1

A copy (in the implicit NULL stream) between the memories of two different devices:
在两个不同设备的内存之间进行复制(在隐式 NULL 流中)

  • does not start until all commands previously issued to either device have completed and
    直到已向任一设备发出的所有命令都已完成后才开始

  • runs to completion before any commands (see Asynchronous Concurrent Execution) issued after the copy to either device can start.
    在将数据复制到任一设备后发出的任何命令(请参阅异步并发执行)开始之前运行完成。

Consistent with the normal behavior of streams, an asynchronous copy between the memories of two devices may overlap with copies or kernels in another stream.
与流的正常行为一致,两个设备内存之间的异步复制可能与另一个流中的复制或内核重叠。

Note that if peer-to-peer access is enabled between two devices via cudaDeviceEnablePeerAccess() as described in Peer-to-Peer Memory Access, peer-to-peer memory copy between these two devices no longer needs to be staged through the host and is therefore faster.
请注意,如果通过 cudaDeviceEnablePeerAccess() 启用了两台设备之间的点对点访问,如点对点内存访问中所述,这两台设备之间的点对点内存复制将不再需要通过主机进行分段,因此速度更快。

3.2.10. Unified Virtual Address Space
3.2.10. 统一虚拟地址空间 

When the application is run as a 64-bit process, a single address space is used for the host and all the devices of compute capability 2.0 and higher. All host memory allocations made via CUDA API calls and all device memory allocations on supported devices are within this virtual address range. As a consequence:
当应用程序作为 64 位进程运行时,主机和所有计算能力为 2.0 及更高的设备使用单个地址空间。通过 CUDA API 调用进行的所有主机内存分配以及在支持设备上进行的所有设备内存分配都在此虚拟地址范围内。因此:

  • The location of any memory on the host allocated through CUDA, or on any of the devices which use the unified address space, can be determined from the value of the pointer using cudaPointerGetAttributes().
    通过使用 cudaPointerGetAttributes() ,可以从指针的值确定通过 CUDA 在主机上分配的任何内存的位置,或者在使用统一地址空间的任何设备上的内存位置。

  • When copying to or from the memory of any device which uses the unified address space, the cudaMemcpyKind parameter of cudaMemcpy*() can be set to cudaMemcpyDefault to determine locations from the pointers. This also works for host pointers not allocated through CUDA, as long as the current device uses unified addressing.
    在复制到或从使用统一地址空间的任何设备的内存时, cudaMemcpy*()cudaMemcpyKind 参数可以设置为 cudaMemcpyDefault ,以确定指针的位置。只要当前设备使用统一寻址,这也适用于未通过 CUDA 分配的主机指针。

  • Allocations via cudaHostAlloc() are automatically portable (see Portable Memory) across all the devices for which the unified address space is used, and pointers returned by cudaHostAlloc() can be used directly from within kernels running on these devices (i.e., there is no need to obtain a device pointer via cudaHostGetDevicePointer() as described in Mapped Memory.
    通过 cudaHostAlloc() 进行的分配在使用统一地址空间的所有设备上都是自动可移植的(请参阅可移植内存),并且通过 cudaHostAlloc() 返回的指针可以直接在运行在这些设备上的内核中使用(即,无需通过 cudaHostGetDevicePointer() 获取设备指针,如在映射内存中描述的那样)。

Applications may query if the unified address space is used for a particular device by checking that the unifiedAddressing device property (see Device Enumeration) is equal to 1.
应用程序可以通过检查 unifiedAddressing 设备属性(请参阅设备枚举)是否等于 1 来查询特定设备是否使用统一地址空间。

3.2.11. Interprocess Communication
3.2.11. 进程间通信 

Any device memory pointer or event handle created by a host thread can be directly referenced by any other thread within the same process. It is not valid outside this process however, and therefore cannot be directly referenced by threads belonging to a different process.
任何由主机线程创建的设备内存指针或事件句柄都可以被同一进程中的任何其他线程直接引用。但是,它在该进程之外是无效的,因此不能被属于不同进程的线程直接引用。

To share device memory pointers and events across processes, an application must use the Inter Process Communication API, which is described in detail in the reference manual. The IPC API is only supported for 64-bit processes on Linux and for devices of compute capability 2.0 and higher. Note that the IPC API is not supported for cudaMallocManaged allocations.
要跨进程共享设备内存指针和事件,应用程序必须使用详细描述在参考手册中的进程间通信 API。IPC API 仅支持 Linux 上的 64 位进程和计算能力为 2.0 及更高的设备。请注意,IPC API 不支持 cudaMallocManaged 分配。

Using this API, an application can get the IPC handle for a given device memory pointer using cudaIpcGetMemHandle(), pass it to another process using standard IPC mechanisms (for example, interprocess shared memory or files), and use cudaIpcOpenMemHandle() to retrieve a device pointer from the IPC handle that is a valid pointer within this other process. Event handles can be shared using similar entry points.
使用此 API,应用程序可以使用 cudaIpcGetMemHandle() 获取给定设备内存指针的 IPC 句柄,将其传递给另一个进程使用标准 IPC 机制(例如,进程间共享内存或文件),并使用 cudaIpcOpenMemHandle() 从 IPC 句柄中检索设备指针,该指针在另一个进程中是有效的指针。事件句柄可以使用类似的入口点共享。

Note that allocations made by cudaMalloc() may be sub-allocated from a larger block of memory for performance reasons. In such case, CUDA IPC APIs will share the entire underlying memory block which may cause other sub-allocations to be shared, which can potentially lead to information disclosure between processes. To prevent this behavior, it is recommended to only share allocations with a 2MiB aligned size.
请注意,由 cudaMalloc() 进行的分配可能是出于性能原因从较大的内存块中进行子分配。在这种情况下,CUDA IPC API 将共享整个底层内存块,这可能导致其他子分配被共享,从而可能导致进程之间的信息泄露。为防止这种行为,建议仅与 2MiB 对齐大小共享分配。

An example of using the IPC API is where a single primary process generates a batch of input data, making the data available to multiple secondary processes without requiring regeneration or copying.
使用 IPC API 的一个示例是,一个单一的主进程生成一批输入数据,使数据可用于多个次要进程,而无需重新生成或复制。

Applications using CUDA IPC to communicate with each other should be compiled, linked, and run with the same CUDA driver and runtime.
使用 CUDA IPC 进行通信的应用程序应使用相同的 CUDA 驱动程序和运行时进行编译、链接和运行。

Note 注意

Since CUDA 11.5, only events-sharing IPC APIs are supported on L4T and embedded Linux Tegra devices with compute capability 7.x and higher. The memory-sharing IPC APIs are still not supported on Tegra platforms.
自 CUDA 11.5 起,在具有计算能力 7.x 及更高版本的 L4T 和嵌入式 Linux Tegra 设备上,仅支持共享事件的 IPC API。Tegra 平台仍不支持内存共享的 IPC API。

3.2.12. Error Checking
3.2.12. 错误检查 

All runtime functions return an error code, but for an asynchronous function (see Asynchronous Concurrent Execution), this error code cannot possibly report any of the asynchronous errors that could occur on the device since the function returns before the device has completed the task; the error code only reports errors that occur on the host prior to executing the task, typically related to parameter validation; if an asynchronous error occurs, it will be reported by some subsequent unrelated runtime function call.
所有运行时函数都会返回一个错误代码,但对于异步函数(请参阅异步并发执行),这个错误代码不可能报告设备上可能发生的任何异步错误,因为该函数在设备完成任务之前返回;错误代码仅报告在执行任务之前在主机上发生的错误,通常与参数验证有关;如果发生异步错误,将由某个后续不相关的运行时函数调用报告。

The only way to check for asynchronous errors just after some asynchronous function call is therefore to synchronize just after the call by calling cudaDeviceSynchronize() (or by using any other synchronization mechanisms described in Asynchronous Concurrent Execution) and checking the error code returned by cudaDeviceSynchronize().
检查异步函数调用后的异步错误的唯一方法是在调用后立即同步,方法是调用 cudaDeviceSynchronize() (或使用《异步并发执行》中描述的任何其他同步机制),并检查 cudaDeviceSynchronize() 返回的错误代码。

The runtime maintains an error variable for each host thread that is initialized to cudaSuccess and is overwritten by the error code every time an error occurs (be it a parameter validation error or an asynchronous error). cudaPeekAtLastError() returns this variable. cudaGetLastError() returns this variable and resets it to cudaSuccess.
运行时为每个主机线程维护一个错误变量,该变量初始化为 cudaSuccess ,并在每次发生错误时被错误代码覆盖(无论是参数验证错误还是异步错误)。 cudaPeekAtLastError() 返回此变量。 cudaGetLastError() 返回此变量并将其重置为 cudaSuccess

Kernel launches do not return any error code, so cudaPeekAtLastError() or cudaGetLastError() must be called just after the kernel launch to retrieve any pre-launch errors. To ensure that any error returned by cudaPeekAtLastError() or cudaGetLastError() does not originate from calls prior to the kernel launch, one has to make sure that the runtime error variable is set to cudaSuccess just before the kernel launch, for example, by calling cudaGetLastError() just before the kernel launch. Kernel launches are asynchronous, so to check for asynchronous errors, the application must synchronize in-between the kernel launch and the call to cudaPeekAtLastError() or cudaGetLastError().
内核启动不会返回任何错误代码,因此必须在内核启动后立即调用 cudaPeekAtLastError()cudaGetLastError() 以检索任何启动前的错误。为确保 cudaPeekAtLastError()cudaGetLastError() 返回的任何错误不是来自内核启动前的调用,必须确保在内核启动之前将运行时错误变量设置为 cudaSuccess ,例如,在内核启动之前调用 cudaGetLastError() 。内核启动是异步的,因此要检查异步错误,应用程序必须在内核启动和调用 cudaPeekAtLastError()cudaGetLastError() 之间进行同步。

Note that cudaErrorNotReady that may be returned by cudaStreamQuery() and cudaEventQuery() is not considered an error and is therefore not reported by cudaPeekAtLastError() or cudaGetLastError().
请注意, cudaStreamQuery()cudaEventQuery() 可能返回的 cudaErrorNotReady 不被视为错误,因此 cudaPeekAtLastError()cudaGetLastError() 不会报告。

3.2.13. Call Stack
3.2.13. 调用堆栈 

On devices of compute capability 2.x and higher, the size of the call stack can be queried usingcudaDeviceGetLimit() and set using cudaDeviceSetLimit().
在计算能力为 2.x 及更高版本的设备上,可以使用 cudaDeviceGetLimit() 查询调用堆栈的大小,并使用 cudaDeviceSetLimit() 设置。

When the call stack overflows, the kernel call fails with a stack overflow error if the application is run via a CUDA debugger (CUDA-GDB, Nsight) or an unspecified launch error, otherwise. When the compiler cannot determine the stack size, it issues a warning saying Stack size cannot be statically determined. This is usually the case with recursive functions. Once this warning is issued, user will need to set stack size manually if default stack size is not sufficient.
当调用堆栈溢出时,如果应用程序通过 CUDA 调试器(CUDA-GDB、Nsight)运行,内核调用将因堆栈溢出错误而失败,否则将因未指定的启动错误而失败。当编译器无法确定堆栈大小时,它会发出警告,指出无法静态确定堆栈大小。这通常发生在递归函数中。一旦发出此警告,用户将需要手动设置堆栈大小,如果默认堆栈大小不足。

3.2.14. Texture and Surface Memory
3.2.14. 纹理和表面内存 

CUDA supports a subset of the texturing hardware that the GPU uses for graphics to access texture and surface memory. Reading data from texture or surface memory instead of global memory can have several performance benefits as described in Device Memory Accesses.
CUDA 支持 GPU 用于图形访问纹理和表面内存的部分硬件。从纹理或表面内存读取数据而不是全局内存可以带来几个性能优势,如《设备内存访问》中所述。

3.2.14.1. Texture Memory
3.2.14.1. 纹理内存 

Texture memory is read from kernels using the device functions described in Texture Functions. The process of reading a texture calling one of these functions is called a texture fetch. Each texture fetch specifies a parameter called a texture object for the texture object API.
纹理内存是使用纹理函数中描述的设备函数从内核中读取的。调用这些函数之一读取纹理的过程称为纹理获取。每次纹理获取都会为纹理对象 API 指定一个称为纹理对象的参数。

The texture object specifies:
纹理对象指定:

  • The texture, which is the piece of texture memory that is fetched. Texture objects are created at runtime and the texture is specified when creating the texture object as described in Texture Object API.
    纹理,即从纹理内存中获取的纹理片段。纹理对象是在运行时创建的,在创建纹理对象时指定纹理,如纹理对象 API 中所述。

  • Its dimensionality that specifies whether the texture is addressed as a one dimensional array using one texture coordinate, a two-dimensional array using two texture coordinates, or a three-dimensional array using three texture coordinates. Elements of the array are called texels, short for texture elements. The texture width, height, and depth refer to the size of the array in each dimension. Table 21 lists the maximum texture width, height, and depth depending on the compute capability of the device.
    它的维度指定了纹理是作为一维数组(使用一个纹理坐标)、二维数组(使用两个纹理坐标)还是三维数组(使用三个纹理坐标)进行寻址。数组的元素称为纹素,即纹理元素的缩写。纹理的宽度、高度和深度指的是每个维度上数组的大小。表 21 列出了根据设备的计算能力确定的最大纹理宽度、高度和深度。

  • The type of a texel, which is restricted to the basic integer and single-precision floating-point types and any of the 1-, 2-, and 4-component vector types defined in Built-in Vector Types that are derived from the basic integer and single-precision floating-point types.
    一个 texel 的类型,限制为基本整数和单精度浮点类型以及在内置矢量类型中定义的任何 1、2 和 4 分量矢量类型,这些类型是从基本整数和单精度浮点类型派生的。

  • The read mode, which is equal to cudaReadModeNormalizedFloat or cudaReadModeElementType. If it is cudaReadModeNormalizedFloat and the type of the texel is a 16-bit or 8-bit integer type, the value returned by the texture fetch is actually returned as floating-point type and the full range of the integer type is mapped to [0.0, 1.0] for unsigned integer type and [-1.0, 1.0] for signed integer type; for example, an unsigned 8-bit texture element with the value 0xff reads as 1. If it is cudaReadModeElementType, no conversion is performed.
    读取模式,等于 cudaReadModeNormalizedFloatcudaReadModeElementType 。如果它是 cudaReadModeNormalizedFloat ,并且纹素的类型是 16 位或 8 位整数类型,则纹理获取返回的值实际上以浮点类型返回,并且整数类型的完整范围映射为[0.0, 1.0]对于无符号整数类型和[-1.0, 1.0]对于有符号整数类型;例如,值为 0xff 的无符号 8 位纹理元素读取为 1。如果是 cudaReadModeElementType ,则不执行转换。

  • Whether texture coordinates are normalized or not. By default, textures are referenced (by the functions of Texture Functions) using floating-point coordinates in the range [0, N-1] where N is the size of the texture in the dimension corresponding to the coordinate. For example, a texture that is 64x32 in size will be referenced with coordinates in the range [0, 63] and [0, 31] for the x and y dimensions, respectively. Normalized texture coordinates cause the coordinates to be specified in the range [0.0, 1.0-1/N] instead of [0, N-1], so the same 64x32 texture would be addressed by normalized coordinates in the range [0, 1-1/N] in both the x and y dimensions. Normalized texture coordinates are a natural fit to some applications’ requirements, if it is preferable for the texture coordinates to be independent of the texture size.
    纹理坐标是否被归一化。默认情况下,纹理(通过纹理函数的函数)使用浮点坐标在范围[0,N-1]内引用纹理,其中 N 是与坐标对应的维度中纹理的大小。例如,大小为 64x32 的纹理将使用范围为[0,63]和[0,31]的坐标进行引用 x 和 y 维度。归一化纹理坐标导致坐标在范围[0.0,1.0-1/N]内指定,而不是[0,N-1],因此相同的 64x32 纹理将通过归一化坐标在 x 和 y 维度中的范围[0,1-1/N]中寻址。如果纹理坐标最好独立于纹理大小,则归一化纹理坐标是某些应用程序要求的自然选择。

  • The addressing mode. It is valid to call the device functions of Section B.8 with coordinates that are out of range. The addressing mode defines what happens in that case. The default addressing mode is to clamp the coordinates to the valid range: [0, N) for non-normalized coordinates and [0.0, 1.0) for normalized coordinates. If the border mode is specified instead, texture fetches with out-of-range texture coordinates return zero. For normalized coordinates, the wrap mode and the mirror mode are also available. When using the wrap mode, each coordinate x is converted to frac(x)=x - floor(x) where floor(x) is the largest integer not greater than x. When using the mirror mode, each coordinate x is converted to frac(x) if floor(x) is even and 1-frac(x) if floor(x) is odd. The addressing mode is specified as an array of size three whose first, second, and third elements specify the addressing mode for the first, second, and third texture coordinates, respectively; the addressing mode are cudaAddressModeBorder, cudaAddressModeClamp, cudaAddressModeWrap, and cudaAddressModeMirror; cudaAddressModeWrap and cudaAddressModeMirror are only supported for normalized texture coordinates
    寻址模式。可以使用超出范围的坐标调用第 B.8 节的设备函数。寻址模式定义了在这种情况下会发生什么。默认的寻址模式是将坐标夹紧到有效范围:非归一化坐标为[0,N),归一化坐标为[0.0,1.0)。如果指定了边界模式,那么带有超出范围纹理坐标的纹理获取将返回零。对于归一化坐标,还可以使用环绕模式和镜像模式。使用环绕模式时,每个坐标 x 都会转换为 frac(x)=x - floor(x),其中 floor(x)是不大于 x 的最大整数。使用镜像模式时,每个坐标 x 会转换为 frac(x),如果 floor(x)是偶数,如果 floor(x)是奇数,则为 1-frac(x)。寻址模式被指定为一个大小为三的数组,其第一、第二和第三个元素分别指定了第一、第二和第三个纹理坐标的寻址模式;寻址模式为 cudaAddressModeBordercudaAddressModeClampcudaAddressModeWrapcudaAddressModeMirrorcudaAddressModeWrapcudaAddressModeMirror 仅支持归一化纹理坐标。

  • The filtering mode which specifies how the value returned when fetching the texture is computed based on the input texture coordinates. Linear texture filtering may be done only for textures that are configured to return floating-point data. It performs low-precision interpolation between neighboring texels. When enabled, the texels surrounding a texture fetch location are read and the return value of the texture fetch is interpolated based on where the texture coordinates fell between the texels. Simple linear interpolation is performed for one-dimensional textures, bilinear interpolation for two-dimensional textures, and trilinear interpolation for three-dimensional textures. Texture Fetching gives more details on texture fetching. The filtering mode is equal to cudaFilterModePoint or cudaFilterModeLinear. If it is cudaFilterModePoint, the returned value is the texel whose texture coordinates are the closest to the input texture coordinates. If it is cudaFilterModeLinear, the returned value is the linear interpolation of the two (for a one-dimensional texture), four (for a two dimensional texture), or eight (for a three dimensional texture) texels whose texture coordinates are the closest to the input texture coordinates. cudaFilterModeLinear is only valid for returned values of floating-point type.
    指定在提取纹理时如何计算返回值的过滤模式,基于输入的纹理坐标。线性纹理过滤仅适用于配置为返回浮点数据的纹理。它在相邻纹素之间执行低精度插值。启用时,读取纹理提取位置周围的纹素,并根据纹理坐标落在哪些纹素之间进行插值以返回纹理提取的返回值。一维纹理执行简单线性插值,二维纹理执行双线性插值,三维纹理执行三线性插值。纹理提取提供有关纹理提取的更多详细信息。过滤模式等于 cudaFilterModePointcudaFilterModeLinear 。如果是 cudaFilterModePoint ,返回值是其纹理坐标最接近输入纹理坐标的纹素。如果是 cudaFilterModeLinear ,返回值是最接近输入纹理坐标的一个(一维纹理)、四个(二维纹理)或八个(三维纹理)纹素的线性插值。 cudaFilterModeLinear 仅对浮点类型的返回值有效。

Texture Object API introduces the texture object API.
纹理对象 API 引入了纹理对象 API。

16-Bit Floating-Point Textures explains how to deal with 16-bit floating-point textures.
16 位浮点纹理介绍了如何处理 16 位浮点纹理。

Textures can also be layered as described in Layered Textures.
纹理也可以像在分层纹理中描述的那样进行分层。

Cubemap Textures and Cubemap Layered Textures describe a special type of texture, the cubemap texture.
立方体贴图和立方体分层贴图描述了一种特殊类型的纹理,即立方体贴图。

Texture Gather describes a special texture fetch, texture gather.
纹理聚合描述了一种特殊的纹理获取,纹理聚合。

3.2.14.1.1. Texture Object API
3.2.14.1.1. 纹理对象 API 

A texture object is created using cudaCreateTextureObject() from a resource description of type struct cudaResourceDesc, which specifies the texture, and from a texture description defined as such:
使用 cudaCreateTextureObject() 从类型为 struct cudaResourceDesc 的资源描述创建纹理对象,该描述指定了纹理,并且从定义为以下内容的纹理描述中创建:

struct cudaTextureDesc
{
    enum cudaTextureAddressMode addressMode[3];
    enum cudaTextureFilterMode  filterMode;
    enum cudaTextureReadMode    readMode;
    int                         sRGB;
    int                         normalizedCoords;
    unsigned int                maxAnisotropy;
    enum cudaTextureFilterMode  mipmapFilterMode;
    float                       mipmapLevelBias;
    float                       minMipmapLevelClamp;
    float                       maxMipmapLevelClamp;
};
  • addressMode specifies the addressing mode;
    addressMode 指定寻址模式;

  • filterMode specifies the filter mode;
    filterMode 指定了过滤模式;

  • readMode specifies the read mode;
    readMode 指定读取模式;

  • normalizedCoords specifies whether texture coordinates are normalized or not;
    normalizedCoords 指定纹理坐标是否归一化

  • See reference manual for sRGB, maxAnisotropy, mipmapFilterMode, mipmapLevelBias, minMipmapLevelClamp, and maxMipmapLevelClamp.
    查看参考手册 sRGBmaxAnisotropymipmapFilterModemipmapLevelBiasminMipmapLevelClampmaxMipmapLevelClamp

The following code sample applies some simple transformation kernel to a texture.
以下代码示例将一些简单的转换内核应用于纹理。

// Simple transformation kernel
__global__ void transformKernel(float* output,
                                cudaTextureObject_t texObj,
                                int width, int height,
                                float theta)
{
    // Calculate normalized texture coordinates
    unsigned int x = blockIdx.x * blockDim.x + threadIdx.x;
    unsigned int y = blockIdx.y * blockDim.y + threadIdx.y;

    float u = x / (float)width;
    float v = y / (float)height;

    // Transform coordinates
    u -= 0.5f;
    v -= 0.5f;
    float tu = u * cosf(theta) - v * sinf(theta) + 0.5f;
    float tv = v * cosf(theta) + u * sinf(theta) + 0.5f;

    // Read from texture and write to global memory
    output[y * width + x] = tex2D<float>(texObj, tu, tv);
}
// Host code
int main()
{
    const int height = 1024;
    const int width = 1024;
    float angle = 0.5;

    // Allocate and set some host data
    float *h_data = (float *)std::malloc(sizeof(float) * width * height);
    for (int i = 0; i < height * width; ++i)
        h_data[i] = i;

    // Allocate CUDA array in device memory
    cudaChannelFormatDesc channelDesc =
        cudaCreateChannelDesc(32, 0, 0, 0, cudaChannelFormatKindFloat);
    cudaArray_t cuArray;
    cudaMallocArray(&cuArray, &channelDesc, width, height);

    // Set pitch of the source (the width in memory in bytes of the 2D array pointed
    // to by src, including padding), we dont have any padding
    const size_t spitch = width * sizeof(float);
    // Copy data located at address h_data in host memory to device memory
    cudaMemcpy2DToArray(cuArray, 0, 0, h_data, spitch, width * sizeof(float),
                        height, cudaMemcpyHostToDevice);

    // Specify texture
    struct cudaResourceDesc resDesc;
    memset(&resDesc, 0, sizeof(resDesc));
    resDesc.resType = cudaResourceTypeArray;
    resDesc.res.array.array = cuArray;

    // Specify texture object parameters
    struct cudaTextureDesc texDesc;
    memset(&texDesc, 0, sizeof(texDesc));
    texDesc.addressMode[0] = cudaAddressModeWrap;
    texDesc.addressMode[1] = cudaAddressModeWrap;
    texDesc.filterMode = cudaFilterModeLinear;
    texDesc.readMode = cudaReadModeElementType;
    texDesc.normalizedCoords = 1;

    // Create texture object
    cudaTextureObject_t texObj = 0;
    cudaCreateTextureObject(&texObj, &resDesc, &texDesc, NULL);

    // Allocate result of transformation in device memory
    float *output;
    cudaMalloc(&output, width * height * sizeof(float));

    // Invoke kernel
    dim3 threadsperBlock(16, 16);
    dim3 numBlocks((width + threadsperBlock.x - 1) / threadsperBlock.x,
                    (height + threadsperBlock.y - 1) / threadsperBlock.y);
    transformKernel<<<numBlocks, threadsperBlock>>>(output, texObj, width, height,
                                                    angle);
    // Copy data from device back to host
    cudaMemcpy(h_data, output, width * height * sizeof(float),
                cudaMemcpyDeviceToHost);

    // Destroy texture object
    cudaDestroyTextureObject(texObj);

    // Free device memory
    cudaFreeArray(cuArray);
    cudaFree(output);

    // Free host memory
    free(h_data);

    return 0;
}
3.2.14.1.2. 16-Bit Floating-Point Textures
3.2.14.1.2. 16 位浮点纹理 

The 16-bit floating-point or half format supported by CUDA arrays is the same as the IEEE 754-2008 binary2 format.
CUDA 数组支持的 16 位浮点或半格式与 IEEE 754-2008 binary2 格式相同。

CUDA C++ does not support a matching data type, but provides intrinsic functions to convert to and from the 32-bit floating-point format via the unsigned short type: __float2half_rn(float) and __half2float(unsigned short). These functions are only supported in device code. Equivalent functions for the host code can be found in the OpenEXR library, for example.
CUDA C++不支持匹配的数据类型,但提供了通过 unsigned short 类型进行 32 位浮点格式转换的内置函数: __float2half_rn(float)__half2float(unsigned short) 。这些函数仅在设备代码中受支持。主机代码的等效函数可以在 OpenEXR 库中找到,例如。

16-bit floating-point components are promoted to 32 bit float during texture fetching before any filtering is performed.
在执行任何过滤之前,16 位浮点组件在纹理获取期间被提升为 32 位浮点。

A channel description for the 16-bit floating-point format can be created by calling one of the cudaCreateChannelDescHalf*() functions.
16 位浮点格式的通道描述可以通过调用 cudaCreateChannelDescHalf*() 函数之一来创建。

3.2.14.1.3. Layered Textures
3.2.14.1.3. 分层纹理 

A one-dimensional or two-dimensional layered texture (also known as texture array in Direct3D and array texture in OpenGL) is a texture made up of a sequence of layers, all of which are regular textures of same dimensionality, size, and data type.
一维或二维分层纹理(也称为 Direct3D 中的纹理数组和 OpenGL 中的数组纹理)是由一系列层组成的纹理,所有这些层都是相同维度、大小和数据类型的常规纹理。

A one-dimensional layered texture is addressed using an integer index and a floating-point texture coordinate; the index denotes a layer within the sequence and the coordinate addresses a texel within that layer. A two-dimensional layered texture is addressed using an integer index and two floating-point texture coordinates; the index denotes a layer within the sequence and the coordinates address a texel within that layer.
使用整数索引和浮点纹理坐标来处理一维分层纹理;索引表示序列中的一层,坐标表示该层中的纹素。使用整数索引和两个浮点纹理坐标来处理二维分层纹理;索引表示序列中的一层,坐标表示该层中的纹素。

A layered texture can only be a CUDA array by calling cudaMalloc3DArray() with the cudaArrayLayered flag (and a height of zero for one-dimensional layered texture).
通过使用 cudaArrayLayered 标志(以及高度为零的一维分层纹理),可以通过调用 cudaMalloc3DArray() 将分层纹理设置为 CUDA 数组。

Layered textures are fetched using the device functions described in tex1DLayered() and tex2DLayered(). Texture filtering (see Texture Fetching) is done only within a layer, not across layers.
分层纹理是使用 tex1DLayered()和 tex2DLayered()中描述的设备函数获取的。纹理过滤(请参阅纹理获取)仅在一个层内完成,而不是跨层。

Layered textures are only supported on devices of compute capability 2.0 and higher.
分层纹理仅支持计算能力为 2.0 及更高的设备。

3.2.14.1.4. Cubemap Textures
3.2.14.1.4. 立方体贴图 

A cubemap texture is a special type of two-dimensional layered texture that has six layers representing the faces of a cube:
一个立方贴图纹理是一种特殊类型的二维分层纹理,它有六个层,代表立方体的各个面:

  • The width of a layer is equal to its height.
    图层的宽度等于其高度。

  • The cubemap is addressed using three texture coordinates x, y, and z that are interpreted as a direction vector emanating from the center of the cube and pointing to one face of the cube and a texel within the layer corresponding to that face. More specifically, the face is selected by the coordinate with largest magnitude m and the corresponding layer is addressed using coordinates (s/m+1)/2 and (t/m+1)/2 where s and t are defined in Table 3.
    立方体贴图使用三个纹理坐标 x、y 和 z 进行寻址,这些坐标被解释为从立方体中心发出并指向立方体一面及该面对应的图元的方向向量。更具体地,通过具有最大幅度 m 的坐标选择面,并使用坐标 (s/m+1)/2 和 (t/m+1)/2 寻址相应的层,其中 s 和 t 在表 3 中定义。

Table 3 Cubemap Fetch
表 3 立方体贴图获取 

face 面部

m

s

t

|x| > |y| and |x| > |z|  |x| > |y||x| > |z|

x ≥ 0

0

x

-z

-y

x < 0

1

-x

z

-y

|y| > |x| and |y| > |z|  |y| > |x||y| > |z|

y ≥ 0

2

y

x

z

y < 0

3

-y

x

-z

|z| > |x| and |z| > |y|  |z| > |x||z| > |y|

z ≥ 0

4

z

x

-y

z < 0

5

-z

-x

-y

A cubemap texture can only be a CUDA array by calling cudaMalloc3DArray() with the cudaArrayCubemap flag.
通过调用 cudaArrayCubemap 标志,立方贴图纹理只能成为 CUDA 数组。

Cubemap textures are fetched using the device function described in texCubemap().
立方体贴图使用在 texCubemap()中描述的设备函数获取。

Cubemap textures are only supported on devices of compute capability 2.0 and higher.
立方体贴图仅支持计算能力为 2.0 及更高的设备。

3.2.14.1.5. Cubemap Layered Textures
3.2.14.1.5. 立方体贴图层纹理 

A cubemap layered texture is a layered texture whose layers are cubemaps of same dimension.
一个立方体贴图层纹理是一个层纹理,其层是相同维度的立方体贴图。

A cubemap layered texture is addressed using an integer index and three floating-point texture coordinates; the index denotes a cubemap within the sequence and the coordinates address a texel within that cubemap.
使用整数索引和三个浮点纹理坐标来访问立方贴图分层纹理;索引表示序列中的一个立方贴图,而坐标则表示该立方贴图中的一个纹素。

A cubemap layered texture can only be a CUDA array by calling cudaMalloc3DArray() with the cudaArrayLayered and cudaArrayCubemap flags.
立方体贴图分层纹理只能通过调用 cudaMalloc3DArray() 函数并使用 cudaArrayLayeredcudaArrayCubemap 标志来创建 CUDA 数组。

Cubemap layered textures are fetched using the device function described in texCubemapLayered(). Texture filtering (see Texture Fetching) is done only within a layer, not across layers.
使用 texCubemapLayered()中描述的设备函数获取立方贴图分层纹理。纹理过滤(请参阅纹理获取)仅在一个层内进行,而不是跨层。

Cubemap layered textures are only supported on devices of compute capability 2.0 and higher.
Cubemap 分层纹理仅支持计算能力为 2.0 及更高的设备。

3.2.14.1.6. Texture Gather
3.2.14.1.6. 纹理聚合 

Texture gather is a special texture fetch that is available for two-dimensional textures only. It is performed by the tex2Dgather() function, which has the same parameters as tex2D(), plus an additional comp parameter equal to 0, 1, 2, or 3 (see tex2Dgather()). It returns four 32-bit numbers that correspond to the value of the component comp of each of the four texels that would have been used for bilinear filtering during a regular texture fetch. For example, if these texels are of values (253, 20, 31, 255), (250, 25, 29, 254), (249, 16, 37, 253), (251, 22, 30, 250), and comp is 2, tex2Dgather() returns (31, 29, 37, 30).
纹理聚集是仅适用于二维纹理的特殊纹理获取。它由 tex2Dgather() 函数执行,该函数具有与 tex2D() 相同的参数,再加上一个额外的 comp 参数,等于 0、1、2 或 3(请参见 tex2Dgather())。它返回四个 32 位数字,这些数字对应于在常规纹理获取期间用于双线性过滤的四个纹素的每个组件 comp 的值。例如,如果这些纹素的值为(253、20、31、255)、(250、25、29、254)、(249、16、37、253)、(251、22、30、250),且 comp 为 2,则 tex2Dgather() 返回(31、29、37、30)。

Note that texture coordinates are computed with only 8 bits of fractional precision. tex2Dgather() may therefore return unexpected results for cases where tex2D() would use 1.0 for one of its weights (α or β, see Linear Filtering). For example, with an x texture coordinate of 2.49805: xB=x-0.5=1.99805, however the fractional part of xB is stored in an 8-bit fixed-point format. Since 0.99805 is closer to 256.f/256.f than it is to 255.f/256.f, xB has the value 2. A tex2Dgather() in this case would therefore return indices 2 and 3 in x, instead of indices 1 and 2.
请注意,纹理坐标仅使用 8 位小数精度进行计算。因此,对于 tex2D() 中一个权重(α或β,参见线性过滤)为 1.0 的情况, tex2Dgather() 可能会产生意外结果。例如,对于 x 纹理坐标为 2.49805:xB=x-0.5=1.99805,但是 xB 的小数部分以 8 位固定点格式存储。由于 0.99805 更接近 256.f/256.f 而不是 255.f/256.f,因此 xB 的值为 2。在这种情况下, tex2Dgather() 将返回 x 中的索引 2 和 3,而不是索引 1 和 2。

Texture gather is only supported for CUDA arrays created with the cudaArrayTextureGather flag and of width and height less than the maximum specified in Table 21 for texture gather, which is smaller than for regular texture fetch.
纹理聚集仅支持使用 cudaArrayTextureGather 标志创建的 CUDA 数组,其宽度和高度小于表 21 中为纹理聚集指定的最大值,该最大值小于常规纹理提取的最大值。

Texture gather is only supported on devices of compute capability 2.0 and higher.
纹理聚集仅支持计算能力为 2.0 及更高的设备。

3.2.14.2. Surface Memory
3.2.14.2. 表面内存 

For devices of compute capability 2.0 and higher, a CUDA array (described in Cubemap Surfaces), created with the cudaArraySurfaceLoadStore flag, can be read and written via a surface object using the functions described in Surface Functions.
对于计算能力为 2.0 及更高的设备,使用 cudaArraySurfaceLoadStore 标志创建的 CUDA 数组(在立方体贴图表面中描述),可以通过使用表面函数中描述的函数,通过表面对象进行读取和写入。

Table 21 lists the maximum surface width, height, and depth depending on the compute capability of the device.
表 21 列出了设备的计算能力取决于最大表面宽度、高度和深度。

3.2.14.2.1. Surface Object API
3.2.14.2.1. 表面对象 API 

A surface object is created using cudaCreateSurfaceObject() from a resource description of type struct cudaResourceDesc. Unlike texture memory, surface memory uses byte addressing. This means that the x-coordinate used to access a texture element via texture functions needs to be multiplied by the byte size of the element to access the same element via a surface function. For example, the element at texture coordinate x of a one-dimensional floating-point CUDA array bound to a texture object texObj and a surface object surfObj is read using tex1d(texObj, x) via texObj, but surf1Dread(surfObj, 4*x) via surfObj. Similarly, the element at texture coordinate x and y of a two-dimensional floating-point CUDA array bound to a texture object texObj and a surface object surfObj is accessed using tex2d(texObj, x, y) via texObj, but surf2Dread(surfObj, 4*x, y) via surObj (the byte offset of the y-coordinate is internally calculated from the underlying line pitch of the CUDA array).
使用 cudaCreateSurfaceObject() 从类型为 struct cudaResourceDesc 的资源描述创建表面对象。与纹理内存不同,表面内存使用字节寻址。这意味着通过纹理函数访问纹理元素时使用的 x 坐标需要乘以元素的字节大小,以通过表面函数访问相同的元素。例如,绑定到纹理对象 texObj 和表面对象 surfObj 的一维浮点 CUDA 数组的纹理坐标 x 处的元素通过 tex1d(texObj, x) 通过 texObj 读取,但通过 surf1Dread(surfObj, 4*x) 通过 surfObj 读取。类似地,绑定到纹理对象 texObj 和表面对象 surfObj 的二维浮点 CUDA 数组的纹理坐标 x 和 y 处的元素通过 tex2d(texObj, x, y) 通过 texObj 访问,但通过 surf2Dread(surfObj, 4*x, y) 通过 surObj 访问(y 坐标的字节偏移量是从 CUDA 数组的基础行间距内部计算的)。

The following code sample applies some simple transformation kernel to a surface.
以下代码示例将一些简单的转换内核应用于表面。

// Simple copy kernel
__global__ void copyKernel(cudaSurfaceObject_t inputSurfObj,
                           cudaSurfaceObject_t outputSurfObj,
                           int width, int height)
{
    // Calculate surface coordinates
    unsigned int x = blockIdx.x * blockDim.x + threadIdx.x;
    unsigned int y = blockIdx.y * blockDim.y + threadIdx.y;
    if (x < width && y < height) {
        uchar4 data;
        // Read from input surface
        surf2Dread(&data,  inputSurfObj, x * 4, y);
        // Write to output surface
        surf2Dwrite(data, outputSurfObj, x * 4, y);
    }
}

// Host code
int main()
{
    const int height = 1024;
    const int width = 1024;

    // Allocate and set some host data
    unsigned char *h_data =
        (unsigned char *)std::malloc(sizeof(unsigned char) * width * height * 4);
    for (int i = 0; i < height * width * 4; ++i)
        h_data[i] = i;

    // Allocate CUDA arrays in device memory
    cudaChannelFormatDesc channelDesc =
        cudaCreateChannelDesc(8, 8, 8, 8, cudaChannelFormatKindUnsigned);
    cudaArray_t cuInputArray;
    cudaMallocArray(&cuInputArray, &channelDesc, width, height,
                    cudaArraySurfaceLoadStore);
    cudaArray_t cuOutputArray;
    cudaMallocArray(&cuOutputArray, &channelDesc, width, height,
                    cudaArraySurfaceLoadStore);

    // Set pitch of the source (the width in memory in bytes of the 2D array
    // pointed to by src, including padding), we dont have any padding
    const size_t spitch = 4 * width * sizeof(unsigned char);
    // Copy data located at address h_data in host memory to device memory
    cudaMemcpy2DToArray(cuInputArray, 0, 0, h_data, spitch,
                        4 * width * sizeof(unsigned char), height,
                        cudaMemcpyHostToDevice);

    // Specify surface
    struct cudaResourceDesc resDesc;
    memset(&resDesc, 0, sizeof(resDesc));
    resDesc.resType = cudaResourceTypeArray;

    // Create the surface objects
    resDesc.res.array.array = cuInputArray;
    cudaSurfaceObject_t inputSurfObj = 0;
    cudaCreateSurfaceObject(&inputSurfObj, &resDesc);
    resDesc.res.array.array = cuOutputArray;
    cudaSurfaceObject_t outputSurfObj = 0;
    cudaCreateSurfaceObject(&outputSurfObj, &resDesc);

    // Invoke kernel
    dim3 threadsperBlock(16, 16);
    dim3 numBlocks((width + threadsperBlock.x - 1) / threadsperBlock.x,
                    (height + threadsperBlock.y - 1) / threadsperBlock.y);
    copyKernel<<<numBlocks, threadsperBlock>>>(inputSurfObj, outputSurfObj, width,
                                                height);

    // Copy data from device back to host
    cudaMemcpy2DFromArray(h_data, spitch, cuOutputArray, 0, 0,
                            4 * width * sizeof(unsigned char), height,
                            cudaMemcpyDeviceToHost);

    // Destroy surface objects
    cudaDestroySurfaceObject(inputSurfObj);
    cudaDestroySurfaceObject(outputSurfObj);

    // Free device memory
    cudaFreeArray(cuInputArray);
    cudaFreeArray(cuOutputArray);

    // Free host memory
    free(h_data);

  return 0;
}
3.2.14.2.2. Cubemap Surfaces
3.2.14.2.2. 立方体贴图表面 

Cubemap surfaces are accessed usingsurfCubemapread() and surfCubemapwrite() (surfCubemapread and surfCubemapwrite) as a two-dimensional layered surface, i.e., using an integer index denoting a face and two floating-point texture coordinates addressing a texel within the layer corresponding to this face. Faces are ordered as indicated in Table 3.

3.2.14.2.3. Cubemap Layered Surfaces
3.2.14.2.3. 立方体贴图分层表面 

Cubemap layered surfaces are accessed using surfCubemapLayeredread() and surfCubemapLayeredwrite() (surfCubemapLayeredread() and surfCubemapLayeredwrite()) as a two-dimensional layered surface, i.e., using an integer index denoting a face of one of the cubemaps and two floating-point texture coordinates addressing a texel within the layer corresponding to this face. Faces are ordered as indicated in Table 3, so index ((2 * 6) + 3), for example, accesses the fourth face of the third cubemap.
使用 surfCubemapLayeredread()surfCubemapLayeredwrite() (surfCubemapLayeredread()和 surfCubemapLayeredwrite())访问立方体贴图层表面,作为二维分层表面,即使用整数索引表示一个立方体贴图的面,并使用两个浮点纹理坐标来寻址对应于该面的层中的一个纹素。面按照表 3 中指示的顺序排序,因此,例如,索引((2 * 6) + 3)访问第三个立方体贴图的第四个面。

3.2.14.3. CUDA Arrays
3.2.14.3. CUDA 数组 

CUDA arrays are opaque memory layouts optimized for texture fetching. They are one dimensional, two dimensional, or three-dimensional and composed of elements, each of which has 1, 2 or 4 components that may be signed or unsigned 8-, 16-, or 32-bit integers, 16-bit floats, or 32-bit floats. CUDA arrays are only accessible by kernels through texture fetching as described in Texture Memory or surface reading and writing as described in Surface Memory.
CUDA 数组是为纹理获取优化的不透明内存布局。它们是一维、二维或三维的,由元素组成,每个元素具有 1、2 或 4 个组件,可以是有符号或无符号的 8 位、16 位或 32 位整数,16 位浮点数或 32 位浮点数。CUDA 数组只能通过纹理获取(如纹理内存中描述的)或表面读写(如表面内存中描述的)来访问内核。

3.2.14.4. Read/Write Coherency
3.2.14.4. 读/写一致性 

The texture and surface memory is cached (see Device Memory Accesses) and within the same kernel call, the cache is not kept coherent with respect to global memory writes and surface memory writes, so any texture fetch or surface read to an address that has been written to via a global write or a surface write in the same kernel call returns undefined data. In other words, a thread can safely read some texture or surface memory location only if this memory location has been updated by a previous kernel call or memory copy, but not if it has been previously updated by the same thread or another thread from the same kernel call.
纹理和表面内存被缓存(请参阅设备内存访问),在同一内核调用中,缓存与全局内存写入和表面内存写入不保持一致,因此,对于通过全局写入或表面写入在同一内核调用中返回未定义数据的地址进行纹理提取或表面读取。换句话说,只有在某个线程通过先前的内核调用或内存复制更新了某个纹理或表面内存位置时,线程才能安全地读取该位置的一些纹理或表面内存,而不是如果该位置先前已被同一线程或同一内核调用的另一个线程更新。

3.2.15. Graphics Interoperability
3.2.15. 图形互操作性 

Some resources from OpenGL and Direct3D may be mapped into the address space of CUDA, either to enable CUDA to read data written by OpenGL or Direct3D, or to enable CUDA to write data for consumption by OpenGL or Direct3D.
一些来自 OpenGL 和 Direct3D 的资源可以映射到 CUDA 的地址空间中,以便让 CUDA 能够读取由 OpenGL 或 Direct3D 写入的数据,或者让 CUDA 能够写入数据以供 OpenGL 或 Direct3D 使用。

A resource must be registered to CUDA before it can be mapped using the functions mentioned in OpenGL Interoperability and Direct3D Interoperability. These functions return a pointer to a CUDA graphics resource of type struct cudaGraphicsResource. Registering a resource is potentially high-overhead and therefore typically called only once per resource. A CUDA graphics resource is unregistered using cudaGraphicsUnregisterResource(). Each CUDA context which intends to use the resource is required to register it separately.
资源必须在使用 OpenGL 互操作性和 Direct3D 互操作性中提到的函数进行映射之前,才能在 CUDA 中注册。这些函数返回一个类型为 struct cudaGraphicsResource 的 CUDA 图形资源的指针。注册资源可能会带来较高的开销,因此通常每个资源仅调用一次。使用 cudaGraphicsUnregisterResource() 取消注册 CUDA 图形资源。每个打算使用该资源的 CUDA 上下文都需要单独注册。

Once a resource is registered to CUDA, it can be mapped and unmapped as many times as necessary using cudaGraphicsMapResources() and cudaGraphicsUnmapResources(). cudaGraphicsResourceSetMapFlags() can be called to specify usage hints (write-only, read-only) that the CUDA driver can use to optimize resource management.
一旦资源注册到 CUDA,可以使用 cudaGraphicsMapResources()cudaGraphicsUnmapResources() 进行多次映射和取消映射。可以调用 cudaGraphicsResourceSetMapFlags() 来指定使用提示(仅写,仅读),CUDA 驱动程序可以使用这些提示来优化资源管理。

A mapped resource can be read from or written to by kernels using the device memory address returned by cudaGraphicsResourceGetMappedPointer() for buffers andcudaGraphicsSubResourceGetMappedArray() for CUDA arrays.
映射的资源可以通过使用由 cudaGraphicsResourceGetMappedPointer() 返回的设备内存地址(对于缓冲区)和 cudaGraphicsSubResourceGetMappedArray() (对于 CUDA 数组)来读取或写入内核。

Accessing a resource through OpenGL, Direct3D, or another CUDA context while it is mapped produces undefined results. OpenGL Interoperability and Direct3D Interoperability give specifics for each graphics API and some code samples. SLI Interoperability gives specifics for when the system is in SLI mode.
通过 OpenGL、Direct3D 或另一个 CUDA 上下文访问资源时,如果该资源被映射,将产生未定义的结果。OpenGL 互操作性和 Direct3D 互操作性为每个图形 API 提供具体信息和一些代码示例。SLI 互操作性为系统处于 SLI 模式时提供具体信息。

3.2.15.1. OpenGL Interoperability
3.2.15.1. OpenGL 互操作性 

The OpenGL resources that may be mapped into the address space of CUDA are OpenGL buffer, texture, and renderbuffer objects.
可以映射到 CUDA 地址空间的 OpenGL 资源包括 OpenGL 缓冲区、纹理和渲染缓冲对象。

A buffer object is registered using cudaGraphicsGLRegisterBuffer(). In CUDA, it appears as a device pointer and can therefore be read and written by kernels or via cudaMemcpy() calls.
使用 cudaGraphicsGLRegisterBuffer() 注册缓冲区对象。在 CUDA 中,它显示为设备指针,因此可以通过内核读取和写入,或通过 cudaMemcpy() 调用。

A texture or renderbuffer object is registered using cudaGraphicsGLRegisterImage(). In CUDA, it appears as a CUDA array. Kernels can read from the array by binding it to a texture or surface reference. They can also write to it via the surface write functions if the resource has been registered with the cudaGraphicsRegisterFlagsSurfaceLoadStore flag. The array can also be read and written via cudaMemcpy2D() calls. cudaGraphicsGLRegisterImage() supports all texture formats with 1, 2, or 4 components and an internal type of float (for example, GL_RGBA_FLOAT32), normalized integer (for example, GL_RGBA8, GL_INTENSITY16), and unnormalized integer (for example, GL_RGBA8UI) (please note that since unnormalized integer formats require OpenGL 3.0, they can only be written by shaders, not the fixed function pipeline).
使用 cudaGraphicsGLRegisterImage() 注册纹理或渲染缓冲对象。在 CUDA 中,它显示为 CUDA 数组。内核可以通过将其绑定到纹理或表面引用来从数组中读取。如果资源已使用 cudaGraphicsRegisterFlagsSurfaceLoadStore 标志注册,它们还可以通过表面写函数向其写入。该数组还可以通过 cudaMemcpy2D() 调用读取和写入。 cudaGraphicsGLRegisterImage() 支持所有具有 1、2 或 4 个分量和浮点内部类型(例如, GL_RGBA_FLOAT32 )、标准整数(例如, GL_RGBA8, GL_INTENSITY16 )和非标准整数(例如, GL_RGBA8UI )的纹理格式(请注意,由于非标准整数格式需要 OpenGL 3.0,因此只能由着色器写入,而不能由固定功能管线写入)。

The OpenGL context whose resources are being shared has to be current to the host thread making any OpenGL interoperability API calls.
正在共享资源的 OpenGL 上下文必须对进行任何 OpenGL 互操作 API 调用的主机线程是当前的。

Please note: When an OpenGL texture is made bindless (say for example by requesting an image or texture handle using the glGetTextureHandle*/glGetImageHandle* APIs) it cannot be registered with CUDA. The application needs to register the texture for interop before requesting an image or texture handle.
请注意:当 OpenGL 纹理被设置为无绑定(例如通过使用 glGetTextureHandle */ glGetImageHandle * API 请求图像或纹理句柄时),它无法与 CUDA 注册。应用程序需要在请求图像或纹理句柄之前为互操作注册纹理。

The following code sample uses a kernel to dynamically modify a 2D width x height grid of vertices stored in a vertex buffer object:
以下代码示例使用内核动态修改存储在顶点缓冲对象中的 2D width x height 网格的顶点:

GLuint positionsVBO;
struct cudaGraphicsResource* positionsVBO_CUDA;

int main()
{
    // Initialize OpenGL and GLUT for device 0
    // and make the OpenGL context current
    ...
    glutDisplayFunc(display);

    // Explicitly set device 0
    cudaSetDevice(0);

    // Create buffer object and register it with CUDA
    glGenBuffers(1, &positionsVBO);
    glBindBuffer(GL_ARRAY_BUFFER, positionsVBO);
    unsigned int size = width * height * 4 * sizeof(float);
    glBufferData(GL_ARRAY_BUFFER, size, 0, GL_DYNAMIC_DRAW);
    glBindBuffer(GL_ARRAY_BUFFER, 0);
    cudaGraphicsGLRegisterBuffer(&positionsVBO_CUDA,
                                 positionsVBO,
                                 cudaGraphicsMapFlagsWriteDiscard);

    // Launch rendering loop
    glutMainLoop();

    ...
}

void display()
{
    // Map buffer object for writing from CUDA
    float4* positions;
    cudaGraphicsMapResources(1, &positionsVBO_CUDA, 0);
    size_t num_bytes;
    cudaGraphicsResourceGetMappedPointer((void**)&positions,
                                         &num_bytes,
                                         positionsVBO_CUDA));

    // Execute kernel
    dim3 dimBlock(16, 16, 1);
    dim3 dimGrid(width / dimBlock.x, height / dimBlock.y, 1);
    createVertices<<<dimGrid, dimBlock>>>(positions, time,
                                          width, height);

    // Unmap buffer object
    cudaGraphicsUnmapResources(1, &positionsVBO_CUDA, 0);

    // Render from buffer object
    glClear(GL_COLOR_BUFFER_BIT | GL_DEPTH_BUFFER_BIT);
    glBindBuffer(GL_ARRAY_BUFFER, positionsVBO);
    glVertexPointer(4, GL_FLOAT, 0, 0);
    glEnableClientState(GL_VERTEX_ARRAY);
    glDrawArrays(GL_POINTS, 0, width * height);
    glDisableClientState(GL_VERTEX_ARRAY);

    // Swap buffers
    glutSwapBuffers();
    glutPostRedisplay();
}
void deleteVBO()
{
    cudaGraphicsUnregisterResource(positionsVBO_CUDA);
    glDeleteBuffers(1, &positionsVBO);
}

__global__ void createVertices(float4* positions, float time,
                               unsigned int width, unsigned int height)
{
    unsigned int x = blockIdx.x * blockDim.x + threadIdx.x;
    unsigned int y = blockIdx.y * blockDim.y + threadIdx.y;

    // Calculate uv coordinates
    float u = x / (float)width;
    float v = y / (float)height;
    u = u * 2.0f - 1.0f;
    v = v * 2.0f - 1.0f;

    // calculate simple sine wave pattern
    float freq = 4.0f;
    float w = sinf(u * freq + time)
            * cosf(v * freq + time) * 0.5f;

    // Write positions
    positions[y * width + x] = make_float4(u, w, v, 1.0f);
}

On Windows and for Quadro GPUs, cudaWGLGetDevice() can be used to retrieve the CUDA device associated to the handle returned by wglEnumGpusNV(). Quadro GPUs offer higher performance OpenGL interoperability than GeForce and Tesla GPUs in a multi-GPU configuration where OpenGL rendering is performed on the Quadro GPU and CUDA computations are performed on other GPUs in the system.
在 Windows 平台上,对于 Quadro GPU, cudaWGLGetDevice() 可以用来检索与 wglEnumGpusNV() 返回的句柄相关联的 CUDA 设备。在多 GPU 配置中,Quadro GPU 提供比 GeForce 和 Tesla GPU 更高性能的 OpenGL 互操作性,其中 OpenGL 渲染在 Quadro GPU 上执行,而 CUDA 计算在系统中的其他 GPU 上执行。

3.2.15.2. Direct3D Interoperability
3.2.15.2. Direct3D 互操作性 

Direct3D interoperability is supported for Direct3D 9Ex, Direct3D 10, and Direct3D 11.
Direct3D 互操作性支持 Direct3D 9Ex、Direct3D 10 和 Direct3D 11。

A CUDA context may interoperate only with Direct3D devices that fulfill the following criteria: Direct3D 9Ex devices must be created with DeviceType set to D3DDEVTYPE_HAL and BehaviorFlags with the D3DCREATE_HARDWARE_VERTEXPROCESSING flag; Direct3D 10 and Direct3D 11 devices must be created with DriverType set to D3D_DRIVER_TYPE_HARDWARE.
CUDA 上下文只能与满足以下条件的 Direct3D 设备进行交互:Direct3D 9Ex 设备必须使用 D3DCREATE_HARDWARE_VERTEXPROCESSING 标志创建,其中 DeviceType 设置为 D3DDEVTYPE_HALBehaviorFlags 设置为 D3DCREATE_HARDWARE_VERTEXPROCESSING ;Direct3D 10 和 Direct3D 11 设备必须使用 DriverType 设置为 D3D_DRIVER_TYPE_HARDWARE 创建。

The Direct3D resources that may be mapped into the address space of CUDA are Direct3D buffers, textures, and surfaces. These resources are registered using cudaGraphicsD3D9RegisterResource(), cudaGraphicsD3D10RegisterResource(), and cudaGraphicsD3D11RegisterResource().
可以映射到 CUDA 地址空间的 Direct3D 资源包括 Direct3D 缓冲区、纹理和表面。这些资源是使用 cudaGraphicsD3D9RegisterResource()cudaGraphicsD3D10RegisterResource()cudaGraphicsD3D11RegisterResource() 进行注册的。

The following code sample uses a kernel to dynamically modify a 2D width x height grid of vertices stored in a vertex buffer object.
以下代码示例使用内核动态修改存储在顶点缓冲对象中的 2D width x height 网格的顶点。

3.2.15.2.1. Direct3D 9 Version
3.2.15.2.1. Direct3D 9 版本 
IDirect3D9* D3D;
IDirect3DDevice9* device;
struct CUSTOMVERTEX {
    FLOAT x, y, z;
    DWORD color;
};
IDirect3DVertexBuffer9* positionsVB;
struct cudaGraphicsResource* positionsVB_CUDA;

int main()
{
    int dev;
    // Initialize Direct3D
    D3D = Direct3DCreate9Ex(D3D_SDK_VERSION);

    // Get a CUDA-enabled adapter
    unsigned int adapter = 0;
    for (; adapter < g_pD3D->GetAdapterCount(); adapter++) {
        D3DADAPTER_IDENTIFIER9 adapterId;
        g_pD3D->GetAdapterIdentifier(adapter, 0, &adapterId);
        if (cudaD3D9GetDevice(&dev, adapterId.DeviceName)
            == cudaSuccess)
            break;
    }

     // Create device
    ...
    D3D->CreateDeviceEx(adapter, D3DDEVTYPE_HAL, hWnd,
                        D3DCREATE_HARDWARE_VERTEXPROCESSING,
                        &params, NULL, &device);

    // Use the same device
    cudaSetDevice(dev);

    // Create vertex buffer and register it with CUDA
    unsigned int size = width * height * sizeof(CUSTOMVERTEX);
    device->CreateVertexBuffer(size, 0, D3DFVF_CUSTOMVERTEX,
                               D3DPOOL_DEFAULT, &positionsVB, 0);
    cudaGraphicsD3D9RegisterResource(&positionsVB_CUDA,
                                     positionsVB,
                                     cudaGraphicsRegisterFlagsNone);
    cudaGraphicsResourceSetMapFlags(positionsVB_CUDA,
                                    cudaGraphicsMapFlagsWriteDiscard);

    // Launch rendering loop
    while (...) {
        ...
        Render();
        ...
    }
    ...
}
void Render()
{
    // Map vertex buffer for writing from CUDA
    float4* positions;
    cudaGraphicsMapResources(1, &positionsVB_CUDA, 0);
    size_t num_bytes;
    cudaGraphicsResourceGetMappedPointer((void**)&positions,
                                         &num_bytes,
                                         positionsVB_CUDA));

    // Execute kernel
    dim3 dimBlock(16, 16, 1);
    dim3 dimGrid(width / dimBlock.x, height / dimBlock.y, 1);
    createVertices<<<dimGrid, dimBlock>>>(positions, time,
                                          width, height);

    // Unmap vertex buffer
    cudaGraphicsUnmapResources(1, &positionsVB_CUDA, 0);

    // Draw and present
    ...
}

void releaseVB()
{
    cudaGraphicsUnregisterResource(positionsVB_CUDA);
    positionsVB->Release();
}

__global__ void createVertices(float4* positions, float time,
                               unsigned int width, unsigned int height)
{
    unsigned int x = blockIdx.x * blockDim.x + threadIdx.x;
    unsigned int y = blockIdx.y * blockDim.y + threadIdx.y;

    // Calculate uv coordinates
    float u = x / (float)width;
    float v = y / (float)height;
    u = u * 2.0f - 1.0f;
    v = v * 2.0f - 1.0f;

    // Calculate simple sine wave pattern
    float freq = 4.0f;
    float w = sinf(u * freq + time)
            * cosf(v * freq + time) * 0.5f;

    // Write positions
    positions[y * width + x] =
                make_float4(u, w, v, __int_as_float(0xff00ff00));
}
3.2.15.2.2. Direct3D 10 Version
3.2.15.2.2. Direct3D 10 版本 
ID3D10Device* device;
struct CUSTOMVERTEX {
    FLOAT x, y, z;
    DWORD color;
};
ID3D10Buffer* positionsVB;
struct cudaGraphicsResource* positionsVB_CUDA;

int main()
{
    int dev;
    // Get a CUDA-enabled adapter
    IDXGIFactory* factory;
    CreateDXGIFactory(__uuidof(IDXGIFactory), (void**)&factory);
    IDXGIAdapter* adapter = 0;
    for (unsigned int i = 0; !adapter; ++i) {
        if (FAILED(factory->EnumAdapters(i, &adapter))
            break;
        if (cudaD3D10GetDevice(&dev, adapter) == cudaSuccess)
            break;
        adapter->Release();
    }
    factory->Release();

    // Create swap chain and device
    ...
    D3D10CreateDeviceAndSwapChain(adapter,
                                  D3D10_DRIVER_TYPE_HARDWARE, 0,
                                  D3D10_CREATE_DEVICE_DEBUG,
                                  D3D10_SDK_VERSION,
                                  &swapChainDesc, &swapChain,
                                  &device);
    adapter->Release();

    // Use the same device
    cudaSetDevice(dev);

    // Create vertex buffer and register it with CUDA
    unsigned int size = width * height * sizeof(CUSTOMVERTEX);
    D3D10_BUFFER_DESC bufferDesc;
    bufferDesc.Usage          = D3D10_USAGE_DEFAULT;
    bufferDesc.ByteWidth      = size;
    bufferDesc.BindFlags      = D3D10_BIND_VERTEX_BUFFER;
    bufferDesc.CPUAccessFlags = 0;
    bufferDesc.MiscFlags      = 0;
    device->CreateBuffer(&bufferDesc, 0, &positionsVB);
    cudaGraphicsD3D10RegisterResource(&positionsVB_CUDA,
                                      positionsVB,
                                      cudaGraphicsRegisterFlagsNone);
                                      cudaGraphicsResourceSetMapFlags(positionsVB_CUDA,
                                      cudaGraphicsMapFlagsWriteDiscard);

    // Launch rendering loop
    while (...) {
        ...
        Render();
        ...
    }
    ...
}
void Render()
{
    // Map vertex buffer for writing from CUDA
    float4* positions;
    cudaGraphicsMapResources(1, &positionsVB_CUDA, 0);
    size_t num_bytes;
    cudaGraphicsResourceGetMappedPointer((void**)&positions,
                                         &num_bytes,
                                         positionsVB_CUDA));

    // Execute kernel
    dim3 dimBlock(16, 16, 1);
    dim3 dimGrid(width / dimBlock.x, height / dimBlock.y, 1);
    createVertices<<<dimGrid, dimBlock>>>(positions, time,
                                          width, height);

    // Unmap vertex buffer
    cudaGraphicsUnmapResources(1, &positionsVB_CUDA, 0);

    // Draw and present
    ...
}

void releaseVB()
{
    cudaGraphicsUnregisterResource(positionsVB_CUDA);
    positionsVB->Release();
}

__global__ void createVertices(float4* positions, float time,
                               unsigned int width, unsigned int height)
{
    unsigned int x = blockIdx.x * blockDim.x + threadIdx.x;
    unsigned int y = blockIdx.y * blockDim.y + threadIdx.y;

    // Calculate uv coordinates
    float u = x / (float)width;
    float v = y / (float)height;
    u = u * 2.0f - 1.0f;
    v = v * 2.0f - 1.0f;

    // Calculate simple sine wave pattern
    float freq = 4.0f;
    float w = sinf(u * freq + time)
            * cosf(v * freq + time) * 0.5f;

    // Write positions
    positions[y * width + x] =
                make_float4(u, w, v, __int_as_float(0xff00ff00));
}
3.2.15.2.3. Direct3D 11 Version
3.2.15.2.3. Direct3D 11 版本 
ID3D11Device* device;
struct CUSTOMVERTEX {
    FLOAT x, y, z;
    DWORD color;
};
ID3D11Buffer* positionsVB;
struct cudaGraphicsResource* positionsVB_CUDA;

int main()
{
    int dev;
    // Get a CUDA-enabled adapter
    IDXGIFactory* factory;
    CreateDXGIFactory(__uuidof(IDXGIFactory), (void**)&factory);
    IDXGIAdapter* adapter = 0;
    for (unsigned int i = 0; !adapter; ++i) {
        if (FAILED(factory->EnumAdapters(i, &adapter))
            break;
        if (cudaD3D11GetDevice(&dev, adapter) == cudaSuccess)
            break;
        adapter->Release();
    }
    factory->Release();

    // Create swap chain and device
    ...
    sFnPtr_D3D11CreateDeviceAndSwapChain(adapter,
                                         D3D11_DRIVER_TYPE_HARDWARE,
                                         0,
                                         D3D11_CREATE_DEVICE_DEBUG,
                                         featureLevels, 3,
                                         D3D11_SDK_VERSION,
                                         &swapChainDesc, &swapChain,
                                         &device,
                                         &featureLevel,
                                         &deviceContext);
    adapter->Release();

    // Use the same device
    cudaSetDevice(dev);

    // Create vertex buffer and register it with CUDA
    unsigned int size = width * height * sizeof(CUSTOMVERTEX);
    D3D11_BUFFER_DESC bufferDesc;
    bufferDesc.Usage          = D3D11_USAGE_DEFAULT;
    bufferDesc.ByteWidth      = size;
    bufferDesc.BindFlags      = D3D11_BIND_VERTEX_BUFFER;
    bufferDesc.CPUAccessFlags = 0;
    bufferDesc.MiscFlags      = 0;
    device->CreateBuffer(&bufferDesc, 0, &positionsVB);
    cudaGraphicsD3D11RegisterResource(&positionsVB_CUDA,
                                      positionsVB,
                                      cudaGraphicsRegisterFlagsNone);
    cudaGraphicsResourceSetMapFlags(positionsVB_CUDA,
                                    cudaGraphicsMapFlagsWriteDiscard);

    // Launch rendering loop
    while (...) {
        ...
        Render();
        ...
    }
    ...
}
void Render()
{
    // Map vertex buffer for writing from CUDA
    float4* positions;
    cudaGraphicsMapResources(1, &positionsVB_CUDA, 0);
    size_t num_bytes;
    cudaGraphicsResourceGetMappedPointer((void**)&positions,
                                         &num_bytes,
                                         positionsVB_CUDA));

    // Execute kernel
    dim3 dimBlock(16, 16, 1);
    dim3 dimGrid(width / dimBlock.x, height / dimBlock.y, 1);
    createVertices<<<dimGrid, dimBlock>>>(positions, time,
                                          width, height);

    // Unmap vertex buffer
    cudaGraphicsUnmapResources(1, &positionsVB_CUDA, 0);

    // Draw and present
    ...
}

void releaseVB()
{
    cudaGraphicsUnregisterResource(positionsVB_CUDA);
    positionsVB->Release();
}

    __global__ void createVertices(float4* positions, float time,
                          unsigned int width, unsigned int height)
{
    unsigned int x = blockIdx.x * blockDim.x + threadIdx.x;
    unsigned int y = blockIdx.y * blockDim.y + threadIdx.y;

// Calculate uv coordinates
    float u = x / (float)width;
    float v = y / (float)height;
    u = u * 2.0f - 1.0f;
    v = v * 2.0f - 1.0f;

    // Calculate simple sine wave pattern
    float freq = 4.0f;
    float w = sinf(u * freq + time)
            * cosf(v * freq + time) * 0.5f;

    // Write positions
    positions[y * width + x] =
                make_float4(u, w, v, __int_as_float(0xff00ff00));
}

3.2.15.3. SLI Interoperability
3.2.15.3. SLI 互操作性 

In a system with multiple GPUs, all CUDA-enabled GPUs are accessible via the CUDA driver and runtime as separate devices. There are however special considerations as described below when the system is in SLI mode.
在具有多个 GPU 的系统中,所有支持 CUDA 的 GPU 都可以通过 CUDA 驱动程序和运行时作为独立设备访问。但是,当系统处于 SLI 模式时,有一些特殊注意事项如下所述。

First, an allocation in one CUDA device on one GPU will consume memory on other GPUs that are part of the SLI configuration of the Direct3D or OpenGL device. Because of this, allocations may fail earlier than otherwise expected.
首先,在一个 GPU 上的一个 CUDA 设备上的分配将占用 Direct3D 或 OpenGL 设备的 SLI 配置的其他 GPU 上的内存。因此,分配可能比预期的更早失败。

Second, applications should create multiple CUDA contexts, one for each GPU in the SLI configuration. While this is not a strict requirement, it avoids unnecessary data transfers between devices. The application can use the cudaD3D[9|10|11]GetDevices() for Direct3D and cudaGLGetDevices() for OpenGL set of calls to identify the CUDA device handle(s) for the device(s) that are performing the rendering in the current and next frame. Given this information the application will typically choose the appropriate device and map Direct3D or OpenGL resources to the CUDA device returned by cudaD3D[9|10|11]GetDevices() or cudaGLGetDevices() when the deviceList parameter is set to cudaD3D[9|10|11]DeviceListCurrentFrame or cudaGLDeviceListCurrentFrame.
其次,应用程序应为 SLI 配置中的每个 GPU 创建多个 CUDA 上下文。虽然这不是严格要求,但可以避免设备之间不必要的数据传输。应用程序可以使用 cudaD3D[9|10|11]GetDevices() 用于 Direct3D 和 cudaGLGetDevices() 用于 OpenGL 的一组调用来识别当前帧和下一帧中执行渲染的设备的 CUDA 设备句柄。根据这些信息,应用程序通常会选择适当的设备,并在 deviceList 参数设置为 cudaD3D[9|10|11]DeviceListCurrentFramecudaGLDeviceListCurrentFrame 时将 Direct3D 或 OpenGL 资源映射到由 cudaD3D[9|10|11]GetDevices()cudaGLGetDevices() 返回的 CUDA 设备。

Please note that resource returned from cudaGraphicsD9D[9|10|11]RegisterResource and cudaGraphicsGLRegister[Buffer|Image] must be only used on device the registration happened. Therefore on SLI configurations when data for different frames is computed on different CUDA devices it is necessary to register the resources for each separately.
请注意,从 cudaGraphicsD9D[9|10|11]RegisterResourcecudaGraphicsGLRegister[Buffer|Image] 返回的资源只能在发生注册的设备上使用。因此,在 SLI 配置中,当在不同的 CUDA 设备上计算不同帧的数据时,必须分别为每个资源注册。

See Direct3D Interoperability and OpenGL Interoperability for details on how the CUDA runtime interoperate with Direct3D and OpenGL, respectively.
有关 CUDA 运行时如何与 Direct3D 和 OpenGL 进行互操作的详细信息,请参阅 Direct3D 互操作性和 OpenGL 互操作性。

3.2.16. External Resource Interoperability
3.2.16. 外部资源互操作性 

External resource interoperability allows CUDA to import certain resources that are explicitly exported by other APIs. These objects are typically exported by other APIs using handles native to the Operating System, like file descriptors on Linux or NT handles on Windows. They could also be exported using other unified interfaces such as the NVIDIA Software Communication Interface. There are two types of resources that can be imported: memory objects and synchronization objects.
外部资源互操作性允许 CUDA 导入由其他 API 显式导出的某些资源。这些对象通常由其他 API 使用操作系统本机句柄(例如 Linux 上的文件描述符或 Windows 上的 NT 句柄)导出。它们也可以使用其他统一接口(如 NVIDIA 软件通信接口)导出。可以导入两种类型的资源:内存对象和同步对象。

Memory objects can be imported into CUDA using cudaImportExternalMemory(). An imported memory object can be accessed from within kernels using device pointers mapped onto the memory object via cudaExternalMemoryGetMappedBuffer()or CUDA mipmapped arrays mapped via cudaExternalMemoryGetMappedMipmappedArray(). Depending on the type of memory object, it may be possible for more than one mapping to be setup on a single memory object. The mappings must match the mappings setup in the exporting API. Any mismatched mappings result in undefined behavior. Imported memory objects must be freed using cudaDestroyExternalMemory(). Freeing a memory object does not free any mappings to that object. Therefore, any device pointers mapped onto that object must be explicitly freed using cudaFree() and any CUDA mipmapped arrays mapped onto that object must be explicitly freed using cudaFreeMipmappedArray(). It is illegal to access mappings to an object after it has been destroyed.
内存对象可以使用 cudaImportExternalMemory() 导入到 CUDA 中。导入的内存对象可以通过 cudaExternalMemoryGetMappedBuffer() 或通过 cudaExternalMemoryGetMappedMipmappedArray() 映射到内存对象上的设备指针从内核中访问。根据内存对象的类型,可能会在单个内存对象上设置多个映射。映射必须与导出 API 中设置的映射相匹配。任何不匹配的映射都会导致未定义的行为。导入的内存对象必须使用 cudaDestroyExternalMemory() 释放。释放内存对象不会释放对该对象的任何映射。因此,必须显式释放映射到该对象的任何设备指针,使用 cudaFree() 和映射到该对象的任何 CUDA mipmapped 数组必须显式释放,使用 cudaFreeMipmappedArray() 。在销毁对象后访问对象的映射是非法的。

Synchronization objects can be imported into CUDA using cudaImportExternalSemaphore(). An imported synchronization object can then be signaled using cudaSignalExternalSemaphoresAsync() and waited on using cudaWaitExternalSemaphoresAsync(). It is illegal to issue a wait before the corresponding signal has been issued. Also, depending on the type of the imported synchronization object, there may be additional constraints imposed on how they can be signaled and waited on, as described in subsequent sections. Imported semaphore objects must be freed using cudaDestroyExternalSemaphore(). All outstanding signals and waits must have completed before the semaphore object is destroyed.
同步对象可以使用 cudaImportExternalSemaphore() 导入到 CUDA 中。然后可以使用 cudaSignalExternalSemaphoresAsync() 发出导入的同步对象信号,并使用 cudaWaitExternalSemaphoresAsync() 等待它。在发出相应信号之前发出等待是非法的。此外,根据导入的同步对象类型,可能会对如何发出信号和等待进行附加约束,如后续部分所述。导入的信号量对象必须使用 cudaDestroyExternalSemaphore() 释放。在销毁信号量对象之前,必须完成所有未完成的信号和等待。

3.2.16.1. Vulkan Interoperability
3.2.16.1. Vulkan 互操作性 

3.2.16.1.1. Matching device UUIDs
3.2.16.1.1. 匹配设备 UUIDs 

When importing memory and synchronization objects exported by Vulkan, they must be imported and mapped on the same device as they were created on. The CUDA device that corresponds to the Vulkan physical device on which the objects were created can be determined by comparing the UUID of a CUDA device with that of the Vulkan physical device, as shown in the following code sample. Note that the Vulkan physical device should not be part of a device group that contains more than one Vulkan physical device. The device group as returned by vkEnumeratePhysicalDeviceGroups that contains the given Vulkan physical device must have a physical device count of 1.
当导入由 Vulkan 导出的内存和同步对象时,它们必须在创建它们的同一设备上导入和映射。可以通过比较 CUDA 设备的 UUID 与创建对象的 Vulkan 物理设备的 UUID 来确定与之对应的 CUDA 设备,如下面的代码示例所示。请注意,Vulkan 物理设备不应该是包含多个 Vulkan 物理设备的设备组的一部分。由 vkEnumeratePhysicalDeviceGroups 返回的包含给定 Vulkan 物理设备的设备组必须具有物理设备计数为 1。

int getCudaDeviceForVulkanPhysicalDevice(VkPhysicalDevice vkPhysicalDevice) {
    VkPhysicalDeviceIDProperties vkPhysicalDeviceIDProperties = {};
    vkPhysicalDeviceIDProperties.sType = VK_STRUCTURE_TYPE_PHYSICAL_DEVICE_ID_PROPERTIES;
    vkPhysicalDeviceIDProperties.pNext = NULL;

    VkPhysicalDeviceProperties2 vkPhysicalDeviceProperties2 = {};
    vkPhysicalDeviceProperties2.sType = VK_STRUCTURE_TYPE_PHYSICAL_DEVICE_PROPERTIES_2;
    vkPhysicalDeviceProperties2.pNext = &vkPhysicalDeviceIDProperties;

    vkGetPhysicalDeviceProperties2(vkPhysicalDevice, &vkPhysicalDeviceProperties2);

    int cudaDeviceCount;
    cudaGetDeviceCount(&cudaDeviceCount);

    for (int cudaDevice = 0; cudaDevice < cudaDeviceCount; cudaDevice++) {
        cudaDeviceProp deviceProp;
        cudaGetDeviceProperties(&deviceProp, cudaDevice);
        if (!memcmp(&deviceProp.uuid, vkPhysicalDeviceIDProperties.deviceUUID, VK_UUID_SIZE)) {
            return cudaDevice;
        }
    }
    return cudaInvalidDeviceId;
}
3.2.16.1.2. Importing Memory Objects
3.2.16.1.2. 导入内存对象 

On Linux and Windows 10, both dedicated and non-dedicated memory objects exported by Vulkan can be imported into CUDA. On Windows 7, only dedicated memory objects can be imported. When importing a Vulkan dedicated memory object, the flag cudaExternalMemoryDedicated must be set.
在 Linux 和 Windows 10 上,Vulkan 导出的专用和非专用内存对象都可以导入到 CUDA 中。在 Windows 7 上,只能导入专用内存对象。导入 Vulkan 专用内存对象时,必须设置标志 cudaExternalMemoryDedicated

A Vulkan memory object exported using VK_EXTERNAL_MEMORY_HANDLE_TYPE_OPAQUE_FD_BIT can be imported into CUDA using the file descriptor associated with that object as shown below. Note that CUDA assumes ownership of the file descriptor once it is imported. Using the file descriptor after a successful import results in undefined behavior.
使用 VK_EXTERNAL_MEMORY_HANDLE_TYPE_OPAQUE_FD_BIT 导出的 Vulkan 内存对象可以通过使用与该对象关联的文件描述符导入到 CUDA 中,如下所示。请注意,CUDA 假定一旦导入文件描述符,即拥有该文件描述符。在成功导入后继续使用文件描述符会导致未定义的行为。

cudaExternalMemory_t importVulkanMemoryObjectFromFileDescriptor(int fd, unsigned long long size, bool isDedicated) {
    cudaExternalMemory_t extMem = NULL;
    cudaExternalMemoryHandleDesc desc = {};

    memset(&desc, 0, sizeof(desc));

    desc.type = cudaExternalMemoryHandleTypeOpaqueFd;
    desc.handle.fd = fd;
    desc.size = size;
    if (isDedicated) {
        desc.flags |= cudaExternalMemoryDedicated;
    }

    cudaImportExternalMemory(&extMem, &desc);

    // Input parameter 'fd' should not be used beyond this point as CUDA has assumed ownership of it

    return extMem;
}

A Vulkan memory object exported using VK_EXTERNAL_MEMORY_HANDLE_TYPE_OPAQUE_WIN32_BIT can be imported into CUDA using the NT handle associated with that object as shown below. Note that CUDA does not assume ownership of the NT handle and it is the application’s responsibility to close the handle when it is not required anymore. The NT handle holds a reference to the resource, so it must be explicitly freed before the underlying memory can be freed.
使用 VK_EXTERNAL_MEMORY_HANDLE_TYPE_OPAQUE_WIN32_BIT 导出的 Vulkan 内存对象可以使用与该对象关联的 NT 句柄导入到 CUDA,如下所示。请注意,CUDA 不会假定对 NT 句柄的所有权,并且当不再需要时,应用程序有责任关闭句柄。NT 句柄持有资源的引用,因此在底层内存可以释放之前,必须显式释放它。

cudaExternalMemory_t importVulkanMemoryObjectFromNTHandle(HANDLE handle, unsigned long long size, bool isDedicated) {
    cudaExternalMemory_t extMem = NULL;
    cudaExternalMemoryHandleDesc desc = {};

    memset(&desc, 0, sizeof(desc));

    desc.type = cudaExternalMemoryHandleTypeOpaqueWin32;
    desc.handle.win32.handle = handle;
    desc.size = size;
    if (isDedicated) {
        desc.flags |= cudaExternalMemoryDedicated;
    }

    cudaImportExternalMemory(&extMem, &desc);

    // Input parameter 'handle' should be closed if it's not needed anymore
    CloseHandle(handle);

    return extMem;
}

A Vulkan memory object exported using VK_EXTERNAL_MEMORY_HANDLE_TYPE_OPAQUE_WIN32_BIT can also be imported using a named handle if one exists as shown below.
使用 VK_EXTERNAL_MEMORY_HANDLE_TYPE_OPAQUE_WIN32_BIT 导出的 Vulkan 内存对象,如果存在命名句柄,也可以使用命名句柄导入,如下所示。

cudaExternalMemory_t importVulkanMemoryObjectFromNamedNTHandle(LPCWSTR name, unsigned long long size, bool isDedicated) {
    cudaExternalMemory_t extMem = NULL;
    cudaExternalMemoryHandleDesc desc = {};

    memset(&desc, 0, sizeof(desc));

    desc.type = cudaExternalMemoryHandleTypeOpaqueWin32;
    desc.handle.win32.name = (void *)name;
    desc.size = size;
    if (isDedicated) {
        desc.flags |= cudaExternalMemoryDedicated;
    }

    cudaImportExternalMemory(&extMem, &desc);

    return extMem;
}

A Vulkan memory object exported using VK_EXTERNAL_MEMORY_HANDLE_TYPE_OPAQUE_WIN32_KMT_BIT can be imported into CUDA using the globally shared D3DKMT handle associated with that object as shown below. Since a globally shared D3DKMT handle does not hold a reference to the underlying memory it is automatically destroyed when all other references to the resource are destroyed.
使用 VK_EXTERNAL_MEMORY_HANDLE_TYPE_OPAQUE_WIN32_KMT_BIT 导出的 Vulkan 内存对象可以通过与该对象关联的全局共享 D3DKMT 句柄导入到 CUDA 中,如下所示。由于全局共享的 D3DKMT 句柄不持有对底层内存的引用,当对资源的所有其他引用都被销毁时,它会自动销毁。

cudaExternalMemory_t importVulkanMemoryObjectFromKMTHandle(HANDLE handle, unsigned long long size, bool isDedicated) {
    cudaExternalMemory_t extMem = NULL;
    cudaExternalMemoryHandleDesc desc = {};

    memset(&desc, 0, sizeof(desc));

    desc.type = cudaExternalMemoryHandleTypeOpaqueWin32Kmt;
    desc.handle.win32.handle = (void *)handle;
    desc.size = size;
    if (isDedicated) {
        desc.flags |= cudaExternalMemoryDedicated;
    }

    cudaImportExternalMemory(&extMem, &desc);

    return extMem;
}
3.2.16.1.3. Mapping Buffers onto Imported Memory Objects
3.2.16.1.3. 将缓冲区映射到导入的内存对象 

A device pointer can be mapped onto an imported memory object as shown below. The offset and size of the mapping must match that specified when creating the mapping using the corresponding Vulkan API. All mapped device pointers must be freed using cudaFree().
设备指针可以映射到导入的内存对象上,如下所示。映射的偏移量和大小必须与使用相应的 Vulkan API 创建映射时指定的内容相匹配。所有映射的设备指针必须使用 cudaFree() 释放。

void * mapBufferOntoExternalMemory(cudaExternalMemory_t extMem, unsigned long long offset, unsigned long long size) {

    void *ptr = NULL;

    cudaExternalMemoryBufferDesc desc = {};



    memset(&desc, 0, sizeof(desc));



    desc.offset = offset;

    desc.size = size;



    cudaExternalMemoryGetMappedBuffer(&ptr, extMem, &desc);



    // Note: ‘ptr’ must eventually be freed using cudaFree()

    return ptr;

}
3.2.16.1.4. Mapping Mipmapped Arrays onto Imported Memory Objects
3.2.16.1.4. 将多级映射数组映射到导入的内存对象 

A CUDA mipmapped array can be mapped onto an imported memory object as shown below. The offset, dimensions, format and number of mip levels must match that specified when creating the mapping using the corresponding Vulkan API. Additionally, if the mipmapped array is bound as a color target in Vulkan, the flagcudaArrayColorAttachment must be set. All mapped mipmapped arrays must be freed using cudaFreeMipmappedArray(). The following code sample shows how to convert Vulkan parameters into the corresponding CUDA parameters when mapping mipmapped arrays onto imported memory objects.
CUDA mipmapped 数组可以映射到导入的内存对象上,如下所示。偏移量、维度、格式和 mip 级别的数量必须与使用相应的 Vulkan API 创建映射时指定的内容相匹配。此外,如果将 mipmapped 数组绑定为 Vulkan 中的颜色目标,则必须设置标志 cudaArrayColorAttachment 。所有映射的 mipmapped 数组必须使用 cudaFreeMipmappedArray() 释放。以下代码示例显示了在将 mipmapped 数组映射到导入的内存对象时如何将 Vulkan 参数转换为相应的 CUDA 参数。

cudaMipmappedArray_t mapMipmappedArrayOntoExternalMemory(cudaExternalMemory_t extMem, unsigned long long offset, cudaChannelFormatDesc *formatDesc, cudaExtent *extent, unsigned int flags, unsigned int numLevels) {
    cudaMipmappedArray_t mipmap = NULL;
    cudaExternalMemoryMipmappedArrayDesc desc = {};

    memset(&desc, 0, sizeof(desc));

    desc.offset = offset;
    desc.formatDesc = *formatDesc;
    desc.extent = *extent;
    desc.flags = flags;
    desc.numLevels = numLevels;

    // Note: 'mipmap' must eventually be freed using cudaFreeMipmappedArray()
    cudaExternalMemoryGetMappedMipmappedArray(&mipmap, extMem, &desc);

    return mipmap;
}

cudaChannelFormatDesc getCudaChannelFormatDescForVulkanFormat(VkFormat format)
{
    cudaChannelFormatDesc d;

    memset(&d, 0, sizeof(d));

    switch (format) {
    case VK_FORMAT_R8_UINT:             d.x = 8;  d.y = 0;  d.z = 0;  d.w = 0;  d.f = cudaChannelFormatKindUnsigned; break;
    case VK_FORMAT_R8_SINT:             d.x = 8;  d.y = 0;  d.z = 0;  d.w = 0;  d.f = cudaChannelFormatKindSigned;   break;
    case VK_FORMAT_R8G8_UINT:           d.x = 8;  d.y = 8;  d.z = 0;  d.w = 0;  d.f = cudaChannelFormatKindUnsigned; break;
    case VK_FORMAT_R8G8_SINT:           d.x = 8;  d.y = 8;  d.z = 0;  d.w = 0;  d.f = cudaChannelFormatKindSigned;   break;
    case VK_FORMAT_R8G8B8A8_UINT:       d.x = 8;  d.y = 8;  d.z = 8;  d.w = 8;  d.f = cudaChannelFormatKindUnsigned; break;
    case VK_FORMAT_R8G8B8A8_SINT:       d.x = 8;  d.y = 8;  d.z = 8;  d.w = 8;  d.f = cudaChannelFormatKindSigned;   break;
    case VK_FORMAT_R16_UINT:            d.x = 16; d.y = 0;  d.z = 0;  d.w = 0;  d.f = cudaChannelFormatKindUnsigned; break;
    case VK_FORMAT_R16_SINT:            d.x = 16; d.y = 0;  d.z = 0;  d.w = 0;  d.f = cudaChannelFormatKindSigned;   break;
    case VK_FORMAT_R16G16_UINT:         d.x = 16; d.y = 16; d.z = 0;  d.w = 0;  d.f = cudaChannelFormatKindUnsigned; break;
    case VK_FORMAT_R16G16_SINT:         d.x = 16; d.y = 16; d.z = 0;  d.w = 0;  d.f = cudaChannelFormatKindSigned;   break;
    case VK_FORMAT_R16G16B16A16_UINT:   d.x = 16; d.y = 16; d.z = 16; d.w = 16; d.f = cudaChannelFormatKindUnsigned; break;
    case VK_FORMAT_R16G16B16A16_SINT:   d.x = 16; d.y = 16; d.z = 16; d.w = 16; d.f = cudaChannelFormatKindSigned;   break;
    case VK_FORMAT_R32_UINT:            d.x = 32; d.y = 0;  d.z = 0;  d.w = 0;  d.f = cudaChannelFormatKindUnsigned; break;
    case VK_FORMAT_R32_SINT:            d.x = 32; d.y = 0;  d.z = 0;  d.w = 0;  d.f = cudaChannelFormatKindSigned;   break;
    case VK_FORMAT_R32_SFLOAT:          d.x = 32; d.y = 0;  d.z = 0;  d.w = 0;  d.f = cudaChannelFormatKindFloat;    break;
    case VK_FORMAT_R32G32_UINT:         d.x = 32; d.y = 32; d.z = 0;  d.w = 0;  d.f = cudaChannelFormatKindUnsigned; break;
    case VK_FORMAT_R32G32_SINT:         d.x = 32; d.y = 32; d.z = 0;  d.w = 0;  d.f = cudaChannelFormatKindSigned;   break;
    case VK_FORMAT_R32G32_SFLOAT:       d.x = 32; d.y = 32; d.z = 0;  d.w = 0;  d.f = cudaChannelFormatKindFloat;    break;
    case VK_FORMAT_R32G32B32A32_UINT:   d.x = 32; d.y = 32; d.z = 32; d.w = 32; d.f = cudaChannelFormatKindUnsigned; break;
    case VK_FORMAT_R32G32B32A32_SINT:   d.x = 32; d.y = 32; d.z = 32; d.w = 32; d.f = cudaChannelFormatKindSigned;   break;
    case VK_FORMAT_R32G32B32A32_SFLOAT: d.x = 32; d.y = 32; d.z = 32; d.w = 32; d.f = cudaChannelFormatKindFloat;    break;
    default: assert(0);
    }
    return d;
}

cudaExtent getCudaExtentForVulkanExtent(VkExtent3D vkExt, uint32_t arrayLayers, VkImageViewType vkImageViewType) {
    cudaExtent e = { 0, 0, 0 };

    switch (vkImageViewType) {
    case VK_IMAGE_VIEW_TYPE_1D:         e.width = vkExt.width; e.height = 0;            e.depth = 0;           break;
    case VK_IMAGE_VIEW_TYPE_2D:         e.width = vkExt.width; e.height = vkExt.height; e.depth = 0;           break;
    case VK_IMAGE_VIEW_TYPE_3D:         e.width = vkExt.width; e.height = vkExt.height; e.depth = vkExt.depth; break;
    case VK_IMAGE_VIEW_TYPE_CUBE:       e.width = vkExt.width; e.height = vkExt.height; e.depth = arrayLayers; break;
    case VK_IMAGE_VIEW_TYPE_1D_ARRAY:   e.width = vkExt.width; e.height = 0;            e.depth = arrayLayers; break;
    case VK_IMAGE_VIEW_TYPE_2D_ARRAY:   e.width = vkExt.width; e.height = vkExt.height; e.depth = arrayLayers; break;
    case VK_IMAGE_VIEW_TYPE_CUBE_ARRAY: e.width = vkExt.width; e.height = vkExt.height; e.depth = arrayLayers; break;
    default: assert(0);
    }

    return e;
}

unsigned int getCudaMipmappedArrayFlagsForVulkanImage(VkImageViewType vkImageViewType, VkImageUsageFlags vkImageUsageFlags, bool allowSurfaceLoadStore) {
    unsigned int flags = 0;

    switch (vkImageViewType) {
    case VK_IMAGE_VIEW_TYPE_CUBE:       flags |= cudaArrayCubemap;                    break;
    case VK_IMAGE_VIEW_TYPE_CUBE_ARRAY: flags |= cudaArrayCubemap | cudaArrayLayered; break;
    case VK_IMAGE_VIEW_TYPE_1D_ARRAY:   flags |= cudaArrayLayered;                    break;
    case VK_IMAGE_VIEW_TYPE_2D_ARRAY:   flags |= cudaArrayLayered;                    break;
    default: break;
    }

    if (vkImageUsageFlags & VK_IMAGE_USAGE_COLOR_ATTACHMENT_BIT) {
        flags |= cudaArrayColorAttachment;
    }

    if (allowSurfaceLoadStore) {
        flags |= cudaArraySurfaceLoadStore;
    }
    return flags;
}
3.2.16.1.5. Importing Synchronization Objects
3.2.16.1.5. 导入同步对象 

A Vulkan semaphore object exported using VK_EXTERNAL_SEMAPHORE_HANDLE_TYPE_OPAQUE_FD_BITcan be imported into CUDA using the file descriptor associated with that object as shown below. Note that CUDA assumes ownership of the file descriptor once it is imported. Using the file descriptor after a successful import results in undefined behavior.
使用 VK_EXTERNAL_SEMAPHORE_HANDLE_TYPE_OPAQUE_FD_BIT 导出的 Vulkan 信号量对象可以通过使用与该对象关联的文件描述符导入到 CUDA 中,如下所示。请注意,CUDA 假定一旦导入文件描述符,即拥有该文件描述符。在成功导入后继续使用文件描述符会导致未定义的行为。

cudaExternalSemaphore_t importVulkanSemaphoreObjectFromFileDescriptor(int fd) {
    cudaExternalSemaphore_t extSem = NULL;
    cudaExternalSemaphoreHandleDesc desc = {};

    memset(&desc, 0, sizeof(desc));

    desc.type = cudaExternalSemaphoreHandleTypeOpaqueFd;
    desc.handle.fd = fd;

    cudaImportExternalSemaphore(&extSem, &desc);

    // Input parameter 'fd' should not be used beyond this point as CUDA has assumed ownership of it

    return extSem;
}

A Vulkan semaphore object exported using VK_EXTERNAL_SEMAPHORE_HANDLE_TYPE_OPAQUE_WIN32_BIT can be imported into CUDA using the NT handle associated with that object as shown below. Note that CUDA does not assume ownership of the NT handle and it is the application’s responsibility to close the handle when it is not required anymore. The NT handle holds a reference to the resource, so it must be explicitly freed before the underlying semaphore can be freed.
使用 VK_EXTERNAL_SEMAPHORE_HANDLE_TYPE_OPAQUE_WIN32_BIT 导出的 Vulkan 信号量对象可以使用与该对象关联的 NT 句柄导入到 CUDA 中,如下所示。请注意,CUDA 不会假定对 NT 句柄的所有权,并且应用程序有责任在不再需要时关闭句柄。NT 句柄持有对资源的引用,因此在底层信号量可以释放之前必须显式释放该句柄。

cudaExternalSemaphore_t importVulkanSemaphoreObjectFromNTHandle(HANDLE handle) {
    cudaExternalSemaphore_t extSem = NULL;
    cudaExternalSemaphoreHandleDesc desc = {};

    memset(&desc, 0, sizeof(desc));

    desc.type = cudaExternalSemaphoreHandleTypeOpaqueWin32;
    desc.handle.win32.handle = handle;

    cudaImportExternalSemaphore(&extSem, &desc);

    // Input parameter 'handle' should be closed if it's not needed anymore
    CloseHandle(handle);

    return extSem;
}

A Vulkan semaphore object exported using VK_EXTERNAL_SEMAPHORE_HANDLE_TYPE_OPAQUE_WIN32_BIT can also be imported using a named handle if one exists as shown below.
使用 VK_EXTERNAL_SEMAPHORE_HANDLE_TYPE_OPAQUE_WIN32_BIT 导出的 Vulkan 信号量对象,如果存在命名句柄,也可以使用命名句柄导入,如下所示。

cudaExternalSemaphore_t importVulkanSemaphoreObjectFromNamedNTHandle(LPCWSTR name) {
    cudaExternalSemaphore_t extSem = NULL;
    cudaExternalSemaphoreHandleDesc desc = {};

    memset(&desc, 0, sizeof(desc));

    desc.type = cudaExternalSemaphoreHandleTypeOpaqueWin32;
    desc.handle.win32.name = (void *)name;

    cudaImportExternalSemaphore(&extSem, &desc);

    return extSem;
}

A Vulkan semaphore object exported using VK_EXTERNAL_SEMAPHORE_HANDLE_TYPE_OPAQUE_WIN32_KMT_BIT can be imported into CUDA using the globally shared D3DKMT handle associated with that object as shown below. Since a globally shared D3DKMT handle does not hold a reference to the underlying semaphore it is automatically destroyed when all other references to the resource are destroyed.
使用 VK_EXTERNAL_SEMAPHORE_HANDLE_TYPE_OPAQUE_WIN32_KMT_BIT 导出的 Vulkan 信号量对象可以使用与该对象关联的全局共享 D3DKMT 句柄导入到 CUDA 中,如下所示。由于全局共享的 D3DKMT 句柄不持有对底层信号量的引用,当对资源的所有其他引用都被销毁时,它会被自动销毁。

cudaExternalSemaphore_t importVulkanSemaphoreObjectFromKMTHandle(HANDLE handle) {
    cudaExternalSemaphore_t extSem = NULL;
    cudaExternalSemaphoreHandleDesc desc = {};

    memset(&desc, 0, sizeof(desc));

    desc.type = cudaExternalSemaphoreHandleTypeOpaqueWin32Kmt;
    desc.handle.win32.handle = (void *)handle;

    cudaImportExternalSemaphore(&extSem, &desc);

    return extSem;
}
3.2.16.1.6. Signaling/Waiting on Imported Synchronization Objects
3.2.16.1.6. 在导入的同步对象上发出/等待信号 

An imported Vulkan semaphore object can be signaled as shown below. Signaling such a semaphore object sets it to the signaled state. The corresponding wait that waits on this signal must be issued in Vulkan. Additionally, the wait that waits on this signal must be issued after this signal has been issued.
导入的 Vulkan 信号量对象可以如下所示被标记为已发信号。标记这样的信号量对象会将其设置为已发信号状态。等待此信号的对应等待必须在 Vulkan 中发出。此信号发出后,等待此信号的等待必须在此信号之后发出。

void signalExternalSemaphore(cudaExternalSemaphore_t extSem, cudaStream_t stream) {
    cudaExternalSemaphoreSignalParams params = {};

    memset(&params, 0, sizeof(params));

    cudaSignalExternalSemaphoresAsync(&extSem, &params, 1, stream);
}

An imported Vulkan semaphore object can be waited on as shown below. Waiting on such a semaphore object waits until it reaches the signaled state and then resets it back to the unsignaled state. The corresponding signal that this wait is waiting on must be issued in Vulkan. Additionally, the signal must be issued before this wait can be issued.
导入的 Vulkan 信号量对象可以如下所示等待。等待这样的信号量对象会一直等到它达到信号状态,然后将其重置为未信号状态。此等待等待的相应信号必须在 Vulkan 中发出。此外,必须在发出此等待之前发出信号。

void waitExternalSemaphore(cudaExternalSemaphore_t extSem, cudaStream_t stream) {
    cudaExternalSemaphoreWaitParams params = {};

    memset(&params, 0, sizeof(params));

    cudaWaitExternalSemaphoresAsync(&extSem, &params, 1, stream);
}

3.2.16.2. OpenGL Interoperability
3.2.16.2. OpenGL 互操作性 

Traditional OpenGL-CUDA interop as outlined in OpenGL Interoperability works by CUDA directly consuming handles created in OpenGL. However, since OpenGL can also consume memory and synchronization objects created in Vulkan, there exists an alternative approach to doing OpenGL-CUDA interop. Essentially, memory and synchronization objects exported by Vulkan could be imported into both, OpenGL and CUDA, and then used to coordinate memory accesses between OpenGL and CUDA. Please refer to the following OpenGL extensions for further details on how to import memory and synchronization objects exported by Vulkan:
传统的 OpenGL-CUDA 互操作,如 OpenGL 互操作中所述,是通过 CUDA 直接使用 OpenGL 中创建的句柄来实现的。然而,由于 OpenGL 也可以使用 Vulkan 中创建的内存和同步对象,因此存在另一种实现 OpenGL-CUDA 互操作的方法。基本上,Vulkan 导出的内存和同步对象可以被导入到 OpenGL 和 CUDA 中,并用于协调 OpenGL 和 CUDA 之间的内存访问。有关如何导入 Vulkan 导出的内存和同步对象的详细信息,请参考以下 OpenGL 扩展:

  • GL_EXT_memory_object

  • GL_EXT_memory_object_fd

  • GL_EXT_memory_object_win32

  • GL_EXT_semaphore

  • GL_EXT_semaphore_fd

  • GL_EXT_semaphore_win32

3.2.16.3. Direct3D 12 Interoperability
3.2.16.3. Direct3D 12 互操作性 

3.2.16.3.1. Matching Device LUIDs
3.2.16.3.1. 匹配设备 LUIDs 

When importing memory and synchronization objects exported by Direct3D 12, they must be imported and mapped on the same device as they were created on. The CUDA device that corresponds to the Direct3D 12 device on which the objects were created can be determined by comparing the LUID of a CUDA device with that of the Direct3D 12 device, as shown in the following code sample. Note that the Direct3D 12 device must not be created on a linked node adapter. I.e. the node count as returned by ID3D12Device::GetNodeCount must be 1.
当导入由 Direct3D 12 导出的内存和同步对象时,它们必须在创建它们的相同设备上导入和映射。可以通过比较 CUDA 设备的 LUID 与创建对象的 Direct3D 12 设备的 LUID 来确定与之对应的 CUDA 设备,如下面的代码示例所示。请注意,Direct3D 12 设备不能在链接的节点适配器上创建。即 ID3D12Device::GetNodeCount 返回的节点计数必须为 1。

int getCudaDeviceForD3D12Device(ID3D12Device *d3d12Device) {
    LUID d3d12Luid = d3d12Device->GetAdapterLuid();

    int cudaDeviceCount;
    cudaGetDeviceCount(&cudaDeviceCount);

    for (int cudaDevice = 0; cudaDevice < cudaDeviceCount; cudaDevice++) {
        cudaDeviceProp deviceProp;
        cudaGetDeviceProperties(&deviceProp, cudaDevice);
        char *cudaLuid = deviceProp.luid;

        if (!memcmp(&d3d12Luid.LowPart, cudaLuid, sizeof(d3d12Luid.LowPart)) &&
            !memcmp(&d3d12Luid.HighPart, cudaLuid + sizeof(d3d12Luid.LowPart), sizeof(d3d12Luid.HighPart))) {
            return cudaDevice;
        }
    }
    return cudaInvalidDeviceId;
}
3.2.16.3.2. Importing Memory Objects
3.2.16.3.2. 导入内存对象 

A shareable Direct3D 12 heap memory object, created by setting the flag D3D12_HEAP_FLAG_SHARED in the call to ID3D12Device::CreateHeap, can be imported into CUDA using the NT handle associated with that object as shown below. Note that it is the application’s responsibility to close the NT handle when it is not required anymore. The NT handle holds a reference to the resource, so it must be explicitly freed before the underlying memory can be freed.
一个可共享的 Direct3D 12 堆内存对象,通过在调用 ID3D12Device::CreateHeap 时设置标志 D3D12_HEAP_FLAG_SHARED 创建,可以使用与该对象关联的 NT 句柄导入到 CUDA,如下所示。请注意,当不再需要时,应用程序有责任关闭 NT 句柄。NT 句柄持有对资源的引用,因此必须在可以释放底层内存之前显式释放它。

cudaExternalMemory_t importD3D12HeapFromNTHandle(HANDLE handle, unsigned long long size) {
    cudaExternalMemory_t extMem = NULL;
    cudaExternalMemoryHandleDesc desc = {};

    memset(&desc, 0, sizeof(desc));

    desc.type = cudaExternalMemoryHandleTypeD3D12Heap;
    desc.handle.win32.handle = (void *)handle;
    desc.size = size;

    cudaImportExternalMemory(&extMem, &desc);

    // Input parameter 'handle' should be closed if it's not needed anymore
    CloseHandle(handle);

    return extMem;
}

A shareable Direct3D 12 heap memory object can also be imported using a named handle if one exists as shown below.
可共享的 Direct3D 12 堆内存对象也可以使用命名句柄导入,如果存在的话,如下所示。

cudaExternalMemory_t importD3D12HeapFromNamedNTHandle(LPCWSTR name, unsigned long long size) {
    cudaExternalMemory_t extMem = NULL;
    cudaExternalMemoryHandleDesc desc = {};

    memset(&desc, 0, sizeof(desc));

    desc.type = cudaExternalMemoryHandleTypeD3D12Heap;
    desc.handle.win32.name = (void *)name;
    desc.size = size;

    cudaImportExternalMemory(&extMem, &desc);

    return extMem;
}

A shareable Direct3D 12 committed resource, created by setting the flag D3D12_HEAP_FLAG_SHARED in the call to D3D12Device::CreateCommittedResource, can be imported into CUDA using the NT handle associated with that object as shown below. When importing a Direct3D 12 committed resource, the flag cudaExternalMemoryDedicated must be set. Note that it is the application’s responsibility to close the NT handle when it is not required anymore. The NT handle holds a reference to the resource, so it must be explicitly freed before the underlying memory can be freed.
通过在调用 D3D12Device::CreateCommittedResource 时设置标志 D3D12_HEAP_FLAG_SHARED 创建的可共享的 Direct3D 12 已提交资源,可以使用与该对象关联的 NT 句柄将其导入到 CUDA,如下所示。导入 Direct3D 12 已提交资源时,必须设置标志 cudaExternalMemoryDedicated 。请注意,当不再需要时,关闭 NT 句柄是应用程序的责任。NT 句柄持有对资源的引用,因此必须在底层内存可以释放之前显式释放它。

cudaExternalMemory_t importD3D12CommittedResourceFromNTHandle(HANDLE handle, unsigned long long size) {
    cudaExternalMemory_t extMem = NULL;
    cudaExternalMemoryHandleDesc desc = {};

    memset(&desc, 0, sizeof(desc));

    desc.type = cudaExternalMemoryHandleTypeD3D12Resource;
    desc.handle.win32.handle = (void *)handle;
    desc.size = size;
    desc.flags |= cudaExternalMemoryDedicated;

    cudaImportExternalMemory(&extMem, &desc);

    // Input parameter 'handle' should be closed if it's not needed anymore
    CloseHandle(handle);

    return extMem;
}

A shareable Direct3D 12 committed resource can also be imported using a named handle if one exists as shown below.
一个可共享的 Direct3D 12 已提交资源,如果存在命名句柄,也可以使用命名句柄导入,如下所示。

cudaExternalMemory_t importD3D12CommittedResourceFromNamedNTHandle(LPCWSTR name, unsigned long long size) {
    cudaExternalMemory_t extMem = NULL;
    cudaExternalMemoryHandleDesc desc = {};

    memset(&desc, 0, sizeof(desc));

    desc.type = cudaExternalMemoryHandleTypeD3D12Resource;
    desc.handle.win32.name = (void *)name;
    desc.size = size;
    desc.flags |= cudaExternalMemoryDedicated;

    cudaImportExternalMemory(&extMem, &desc);

    return extMem;
}
3.2.16.3.3. Mapping Buffers onto Imported Memory Objects
3.2.16.3.3. 将缓冲区映射到导入的内存对象 

A device pointer can be mapped onto an imported memory object as shown below. The offset and size of the mapping must match that specified when creating the mapping using the corresponding Direct3D 12 API. All mapped device pointers must be freed using cudaFree().
设备指针可以映射到导入的内存对象上,如下所示。映射的偏移量和大小必须与使用相应的 Direct3D 12 API 创建映射时指定的内容相匹配。所有映射的设备指针必须使用 cudaFree() 释放。

void * mapBufferOntoExternalMemory(cudaExternalMemory_t extMem, unsigned long long offset, unsigned long long size) {
    void *ptr = NULL;
    cudaExternalMemoryBufferDesc desc = {};

    memset(&desc, 0, sizeof(desc));

    desc.offset = offset;
    desc.size = size;

    cudaExternalMemoryGetMappedBuffer(&ptr, extMem, &desc);

    // Note: 'ptr' must eventually be freed using cudaFree()
    return ptr;
}
3.2.16.3.4. Mapping Mipmapped Arrays onto Imported Memory Objects
3.2.16.3.4. 将多级映射数组映射到导入的内存对象 

A CUDA mipmapped array can be mapped onto an imported memory object as shown below. The offset, dimensions, format and number of mip levels must match that specified when creating the mapping using the corresponding Direct3D 12 API. Additionally, if the mipmapped array can be bound as a render target in Direct3D 12, the flag cudaArrayColorAttachment must be set. All mapped mipmapped arrays must be freed using cudaFreeMipmappedArray(). The following code sample shows how to convert Vulkan parameters into the corresponding CUDA parameters when mapping mipmapped arrays onto imported memory objects.
CUDA mipmapped 数组可以映射到导入的内存对象上,如下所示。偏移量、维度、格式和 mip 级别数量必须与使用相应的 Direct3D 12 API 创建映射时指定的内容相匹配。此外,如果 mipmapped 数组可以在 Direct3D 12 中绑定为渲染目标,则必须设置标志 cudaArrayColorAttachment 。所有映射的 mipmapped 数组必须使用 cudaFreeMipmappedArray() 释放。以下代码示例显示了将 Vulkan 参数转换为相应的 CUDA 参数的方法,用于将 mipmapped 数组映射到导入的内存对象上。

cudaMipmappedArray_t mapMipmappedArrayOntoExternalMemory(cudaExternalMemory_t extMem, unsigned long long offset, cudaChannelFormatDesc *formatDesc, cudaExtent *extent, unsigned int flags, unsigned int numLevels) {
    cudaMipmappedArray_t mipmap = NULL;
    cudaExternalMemoryMipmappedArrayDesc desc = {};

    memset(&desc, 0, sizeof(desc));

    desc.offset = offset;
    desc.formatDesc = *formatDesc;
    desc.extent = *extent;
    desc.flags = flags;
    desc.numLevels = numLevels;

    // Note: 'mipmap' must eventually be freed using cudaFreeMipmappedArray()
    cudaExternalMemoryGetMappedMipmappedArray(&mipmap, extMem, &desc);

    return mipmap;
}

cudaChannelFormatDesc getCudaChannelFormatDescForDxgiFormat(DXGI_FORMAT dxgiFormat)
{
    cudaChannelFormatDesc d;

    memset(&d, 0, sizeof(d));

    switch (dxgiFormat) {
    case DXGI_FORMAT_R8_UINT:            d.x = 8;  d.y = 0;  d.z = 0;  d.w = 0;  d.f = cudaChannelFormatKindUnsigned; break;
    case DXGI_FORMAT_R8_SINT:            d.x = 8;  d.y = 0;  d.z = 0;  d.w = 0;  d.f = cudaChannelFormatKindSigned;   break;
    case DXGI_FORMAT_R8G8_UINT:          d.x = 8;  d.y = 8;  d.z = 0;  d.w = 0;  d.f = cudaChannelFormatKindUnsigned; break;
    case DXGI_FORMAT_R8G8_SINT:          d.x = 8;  d.y = 8;  d.z = 0;  d.w = 0;  d.f = cudaChannelFormatKindSigned;   break;
    case DXGI_FORMAT_R8G8B8A8_UINT:      d.x = 8;  d.y = 8;  d.z = 8;  d.w = 8;  d.f = cudaChannelFormatKindUnsigned; break;
    case DXGI_FORMAT_R8G8B8A8_SINT:      d.x = 8;  d.y = 8;  d.z = 8;  d.w = 8;  d.f = cudaChannelFormatKindSigned;   break;
    case DXGI_FORMAT_R16_UINT:           d.x = 16; d.y = 0;  d.z = 0;  d.w = 0;  d.f = cudaChannelFormatKindUnsigned; break;
    case DXGI_FORMAT_R16_SINT:           d.x = 16; d.y = 0;  d.z = 0;  d.w = 0;  d.f = cudaChannelFormatKindSigned;   break;
    case DXGI_FORMAT_R16G16_UINT:        d.x = 16; d.y = 16; d.z = 0;  d.w = 0;  d.f = cudaChannelFormatKindUnsigned; break;
    case DXGI_FORMAT_R16G16_SINT:        d.x = 16; d.y = 16; d.z = 0;  d.w = 0;  d.f = cudaChannelFormatKindSigned;   break;
    case DXGI_FORMAT_R16G16B16A16_UINT:  d.x = 16; d.y = 16; d.z = 16; d.w = 16; d.f = cudaChannelFormatKindUnsigned; break;
    case DXGI_FORMAT_R16G16B16A16_SINT:  d.x = 16; d.y = 16; d.z = 16; d.w = 16; d.f = cudaChannelFormatKindSigned;   break;
    case DXGI_FORMAT_R32_UINT:           d.x = 32; d.y = 0;  d.z = 0;  d.w = 0;  d.f = cudaChannelFormatKindUnsigned; break;
    case DXGI_FORMAT_R32_SINT:           d.x = 32; d.y = 0;  d.z = 0;  d.w = 0;  d.f = cudaChannelFormatKindSigned;   break;
    case DXGI_FORMAT_R32_FLOAT:          d.x = 32; d.y = 0;  d.z = 0;  d.w = 0;  d.f = cudaChannelFormatKindFloat;    break;
    case DXGI_FORMAT_R32G32_UINT:        d.x = 32; d.y = 32; d.z = 0;  d.w = 0;  d.f = cudaChannelFormatKindUnsigned; break;
    case DXGI_FORMAT_R32G32_SINT:        d.x = 32; d.y = 32; d.z = 0;  d.w = 0;  d.f = cudaChannelFormatKindSigned;   break;
    case DXGI_FORMAT_R32G32_FLOAT:       d.x = 32; d.y = 32; d.z = 0;  d.w = 0;  d.f = cudaChannelFormatKindFloat;    break;
    case DXGI_FORMAT_R32G32B32A32_UINT:  d.x = 32; d.y = 32; d.z = 32; d.w = 32; d.f = cudaChannelFormatKindUnsigned; break;
    case DXGI_FORMAT_R32G32B32A32_SINT:  d.x = 32; d.y = 32; d.z = 32; d.w = 32; d.f = cudaChannelFormatKindSigned;   break;
    case DXGI_FORMAT_R32G32B32A32_FLOAT: d.x = 32; d.y = 32; d.z = 32; d.w = 32; d.f = cudaChannelFormatKindFloat;    break;
    default: assert(0);
    }

    return d;
}

cudaExtent getCudaExtentForD3D12Extent(UINT64 width, UINT height, UINT16 depthOrArraySize, D3D12_SRV_DIMENSION d3d12SRVDimension) {
    cudaExtent e = { 0, 0, 0 };

    switch (d3d12SRVDimension) {
    case D3D12_SRV_DIMENSION_TEXTURE1D:        e.width = width; e.height = 0;      e.depth = 0;                break;
    case D3D12_SRV_DIMENSION_TEXTURE2D:        e.width = width; e.height = height; e.depth = 0;                break;
    case D3D12_SRV_DIMENSION_TEXTURE3D:        e.width = width; e.height = height; e.depth = depthOrArraySize; break;
    case D3D12_SRV_DIMENSION_TEXTURECUBE:      e.width = width; e.height = height; e.depth = depthOrArraySize; break;
    case D3D12_SRV_DIMENSION_TEXTURE1DARRAY:   e.width = width; e.height = 0;      e.depth = depthOrArraySize; break;
    case D3D12_SRV_DIMENSION_TEXTURE2DARRAY:   e.width = width; e.height = height; e.depth = depthOrArraySize; break;
    case D3D12_SRV_DIMENSION_TEXTURECUBEARRAY: e.width = width; e.height = height; e.depth = depthOrArraySize; break;
    default: assert(0);
    }

    return e;
}

unsigned int getCudaMipmappedArrayFlagsForD3D12Resource(D3D12_SRV_DIMENSION d3d12SRVDimension, D3D12_RESOURCE_FLAGS d3d12ResourceFlags, bool allowSurfaceLoadStore) {
    unsigned int flags = 0;

    switch (d3d12SRVDimension) {
    case D3D12_SRV_DIMENSION_TEXTURECUBE:      flags |= cudaArrayCubemap;                    break;
    case D3D12_SRV_DIMENSION_TEXTURECUBEARRAY: flags |= cudaArrayCubemap | cudaArrayLayered; break;
    case D3D12_SRV_DIMENSION_TEXTURE1DARRAY:   flags |= cudaArrayLayered;                    break;
    case D3D12_SRV_DIMENSION_TEXTURE2DARRAY:   flags |= cudaArrayLayered;                    break;
    default: break;
    }

    if (d3d12ResourceFlags & D3D12_RESOURCE_FLAG_ALLOW_RENDER_TARGET) {
        flags |= cudaArrayColorAttachment;
    }
    if (allowSurfaceLoadStore) {
        flags |= cudaArraySurfaceLoadStore;
    }

    return flags;
}
3.2.16.3.5. Importing Synchronization Objects
3.2.16.3.5. 导入同步对象 

A shareable Direct3D 12 fence object, created by setting the flag D3D12_FENCE_FLAG_SHARED in the call to ID3D12Device::CreateFence, can be imported into CUDA using the NT handle associated with that object as shown below. Note that it is the application’s responsibility to close the handle when it is not required anymore. The NT handle holds a reference to the resource, so it must be explicitly freed before the underlying semaphore can be freed.
一个可共享的 Direct3D 12 围栏对象,通过在调用 ID3D12Device::CreateFence 时设置标志 D3D12_FENCE_FLAG_SHARED 创建,可以使用与该对象关联的 NT 句柄将其导入到 CUDA,如下所示。请注意,当不再需要时,应用程序有责任关闭句柄。NT 句柄持有对资源的引用,因此在底层信号量可以被释放之前,必须显式释放它。

cudaExternalSemaphore_t importD3D12FenceFromNTHandle(HANDLE handle) {
    cudaExternalSemaphore_t extSem = NULL;
    cudaExternalSemaphoreHandleDesc desc = {};

    memset(&desc, 0, sizeof(desc));

    desc.type = cudaExternalSemaphoreHandleTypeD3D12Fence;
    desc.handle.win32.handle = handle;

    cudaImportExternalSemaphore(&extSem, &desc);

    // Input parameter 'handle' should be closed if it's not needed anymore
    CloseHandle(handle);

    return extSem;
}

A shareable Direct3D 12 fence object can also be imported using a named handle if one exists as shown below.
可以使用下面显示的命名句柄导入可共享的 Direct3D 12 围栏对象。

cudaExternalSemaphore_t importD3D12FenceFromNamedNTHandle(LPCWSTR name) {
    cudaExternalSemaphore_t extSem = NULL;
    cudaExternalSemaphoreHandleDesc desc = {};

    memset(&desc, 0, sizeof(desc));

    desc.type = cudaExternalSemaphoreHandleTypeD3D12Fence;
    desc.handle.win32.name = (void *)name;

    cudaImportExternalSemaphore(&extSem, &desc);

    return extSem;
}
3.2.16.3.6. Signaling/Waiting on Imported Synchronization Objects
3.2.16.3.6. 在导入的同步对象上发出/等待信号 

An imported Direct3D 12 fence object can be signaled as shown below. Signaling such a fence object sets its value to the one specified. The corresponding wait that waits on this signal must be issued in Direct3D 12. Additionally, the wait that waits on this signal must be issued after this signal has been issued.
导入的 Direct3D 12 围栏对象可以如下所示被标记。对这样一个围栏对象的标记会将其值设置为指定的值。等待这个信号的对应等待必须在 Direct3D 12 中发出。此外,等待这个信号的等待必须在此信号发出后发出。

void signalExternalSemaphore(cudaExternalSemaphore_t extSem, unsigned long long value, cudaStream_t stream) {
    cudaExternalSemaphoreSignalParams params = {};

    memset(&params, 0, sizeof(params));

    params.params.fence.value = value;

    cudaSignalExternalSemaphoresAsync(&extSem, &params, 1, stream);
}

An imported Direct3D 12 fence object can be waited on as shown below. Waiting on such a fence object waits until its value becomes greater than or equal to the specified value. The corresponding signal that this wait is waiting on must be issued in Direct3D 12. Additionally, the signal must be issued before this wait can be issued.
导入的 Direct3D 12 围栏对象可以如下所示等待。等待这样的围栏对象直到其值大于或等于指定值。此等待正在等待的相应信号必须在 Direct3D 12 中发出。此外,必须在发出此等待之前发出信号。

void waitExternalSemaphore(cudaExternalSemaphore_t extSem, unsigned long long value, cudaStream_t stream) {
    cudaExternalSemaphoreWaitParams params = {};

    memset(&params, 0, sizeof(params));

    params.params.fence.value = value;

    cudaWaitExternalSemaphoresAsync(&extSem, &params, 1, stream);
}

3.2.16.4. Direct3D 11 Interoperability
3.2.16.4. Direct3D 11 互操作性 

3.2.16.4.1. Matching Device LUIDs
3.2.16.4.1. 匹配设备 LUIDs 

When importing memory and synchronization objects exported by Direct3D 11, they must be imported and mapped on the same device as they were created on. The CUDA device that corresponds to the Direct3D 11 device on which the objects were created can be determined by comparing the LUID of a CUDA device with that of the Direct3D 11 device, as shown in the following code sample.
在导入由 Direct3D 11 导出的内存和同步对象时,它们必须在创建它们的相同设备上导入和映射。可以通过比较 CUDA 设备的 LUID 与创建对象的 Direct3D 11 设备的 LUID 来确定与之对应的 CUDA 设备,如下面的代码示例所示。

int getCudaDeviceForD3D11Device(ID3D11Device *d3d11Device) {
    IDXGIDevice *dxgiDevice;
    d3d11Device->QueryInterface(__uuidof(IDXGIDevice), (void **)&dxgiDevice);

    IDXGIAdapter *dxgiAdapter;
    dxgiDevice->GetAdapter(&dxgiAdapter);

    DXGI_ADAPTER_DESC dxgiAdapterDesc;
    dxgiAdapter->GetDesc(&dxgiAdapterDesc);

    LUID d3d11Luid = dxgiAdapterDesc.AdapterLuid;

    int cudaDeviceCount;
    cudaGetDeviceCount(&cudaDeviceCount);

    for (int cudaDevice = 0; cudaDevice < cudaDeviceCount; cudaDevice++) {
        cudaDeviceProp deviceProp;
        cudaGetDeviceProperties(&deviceProp, cudaDevice);
        char *cudaLuid = deviceProp.luid;

        if (!memcmp(&d3d11Luid.LowPart, cudaLuid, sizeof(d3d11Luid.LowPart)) &&
            !memcmp(&d3d11Luid.HighPart, cudaLuid + sizeof(d3d11Luid.LowPart), sizeof(d3d11Luid.HighPart))) {
            return cudaDevice;
        }
    }
    return cudaInvalidDeviceId;
}
3.2.16.4.2. Importing Memory Objects
3.2.16.4.2. 导入内存对象 

A shareable Direct3D 11 texture resource, viz, ID3D11Texture1D, ID3D11Texture2D or ID3D11Texture3D, can be created by setting either the D3D11_RESOURCE_MISC_SHARED or D3D11_RESOURCE_MISC_SHARED_KEYEDMUTEX (on Windows 7) or D3D11_RESOURCE_MISC_SHARED_NTHANDLE (on Windows 10) when calling ID3D11Device:CreateTexture1D, ID3D11Device:CreateTexture2D or ID3D11Device:CreateTexture3D respectively. A shareable Direct3D 11 buffer resource, ID3D11Buffer, can be created by specifying either of the above flags when calling ID3D11Device::CreateBuffer. A shareable resource created by specifying the D3D11_RESOURCE_MISC_SHARED_NTHANDLE can be imported into CUDA using the NT handle associated with that object as shown below. Note that it is the application’s responsibility to close the NT handle when it is not required anymore. The NT handle holds a reference to the resource, so it must be explicitly freed before the underlying memory can be freed. When importing a Direct3D 11 resource, the flag cudaExternalMemoryDedicated must be set.
可共享的 Direct3D 11 纹理资源,即 ID3D11Texture1DID3D11Texture2DID3D11Texture3D ,可以通过在调用 ID3D11Device:CreateTexture1DID3D11Device:CreateTexture2DID3D11Device:CreateTexture3D 时设置 D3D11_RESOURCE_MISC_SHAREDD3D11_RESOURCE_MISC_SHARED_KEYEDMUTEX (在 Windows 7 上)或 D3D11_RESOURCE_MISC_SHARED_NTHANDLE (在 Windows 10 上)来创建。可共享的 Direct3D 11 缓冲区资源 ID3D11Buffer 可以通过在调用 ID3D11Device::CreateBuffer 时指定上述标志之一来创建。通过指定 D3D11_RESOURCE_MISC_SHARED_NTHANDLE 创建的可共享资源可以使用与该对象关联的 NT 句柄导入到 CUDA 中,如下所示。请注意,当不再需要时,应用程序有责任关闭 NT 句柄。NT 句柄持有对资源的引用,因此在底层内存可以释放之前,必须显式释放它。导入 Direct3D 11 资源时,必须设置标志 cudaExternalMemoryDedicated

cudaExternalMemory_t importD3D11ResourceFromNTHandle(HANDLE handle, unsigned long long size) {
    cudaExternalMemory_t extMem = NULL;
    cudaExternalMemoryHandleDesc desc = {};

    memset(&desc, 0, sizeof(desc));

    desc.type = cudaExternalMemoryHandleTypeD3D11Resource;
    desc.handle.win32.handle = (void *)handle;
    desc.size = size;
    desc.flags |= cudaExternalMemoryDedicated;

    cudaImportExternalMemory(&extMem, &desc);

    // Input parameter 'handle' should be closed if it's not needed anymore
    CloseHandle(handle);

    return extMem;
}

A shareable Direct3D 11 resource can also be imported using a named handle if one exists as shown below.
可以使用下面显示的命名句柄导入可共享的 Direct3D 11 资源。

cudaExternalMemory_t importD3D11ResourceFromNamedNTHandle(LPCWSTR name, unsigned long long size) {
    cudaExternalMemory_t extMem = NULL;
    cudaExternalMemoryHandleDesc desc = {};

    memset(&desc, 0, sizeof(desc));

    desc.type = cudaExternalMemoryHandleTypeD3D11Resource;
    desc.handle.win32.name = (void *)name;
    desc.size = size;
    desc.flags |= cudaExternalMemoryDedicated;

    cudaImportExternalMemory(&extMem, &desc);

    return extMem;
}

A shareable Direct3D 11 resource, created by specifying the D3D11_RESOURCE_MISC_SHARED or D3D11_RESOURCE_MISC_SHARED_KEYEDMUTEX, can be imported into CUDA using the globally shared D3DKMT handle associated with that object as shown below. Since a globally shared D3DKMT handle does not hold a reference to the underlying memory it is automatically destroyed when all other references to the resource are destroyed.
一个可共享的 Direct3D 11 资源,通过指定 D3D11_RESOURCE_MISC_SHAREDD3D11_RESOURCE_MISC_SHARED_KEYEDMUTEX 创建,可以使用与该对象关联的全局共享 D3DKMT 句柄导入到 CUDA 中,如下所示。由于全局共享 D3DKMT 句柄不持有对底层内存的引用,当对该资源的所有其他引用都被销毁时,它会被自动销毁。

cudaExternalMemory_t importD3D11ResourceFromKMTHandle(HANDLE handle, unsigned long long size) {
    cudaExternalMemory_t extMem = NULL;
    cudaExternalMemoryHandleDesc desc = {};

    memset(&desc, 0, sizeof(desc));

    desc.type = cudaExternalMemoryHandleTypeD3D11ResourceKmt;
    desc.handle.win32.handle = (void *)handle;
    desc.size = size;
    desc.flags |= cudaExternalMemoryDedicated;

    cudaImportExternalMemory(&extMem, &desc);

    return extMem;
}
3.2.16.4.3. Mapping Buffers onto Imported Memory Objects
3.2.16.4.3. 将缓冲区映射到导入的内存对象 

A device pointer can be mapped onto an imported memory object as shown below. The offset and size of the mapping must match that specified when creating the mapping using the corresponding Direct3D 11 API. All mapped device pointers must be freed using cudaFree().
设备指针可以映射到导入的内存对象上,如下所示。映射的偏移量和大小必须与使用相应的 Direct3D 11 API 创建映射时指定的内容相匹配。所有映射的设备指针必须使用 cudaFree() 释放。

void * mapBufferOntoExternalMemory(cudaExternalMemory_t extMem, unsigned long long offset, unsigned long long size) {
    void *ptr = NULL;
    cudaExternalMemoryBufferDesc desc = {};

    memset(&desc, 0, sizeof(desc));

    desc.offset = offset;
    desc.size = size;

    cudaExternalMemoryGetMappedBuffer(&ptr, extMem, &desc);

    // Note: ‘ptr’ must eventually be freed using cudaFree()
    return ptr;
}
3.2.16.4.4. Mapping Mipmapped Arrays onto Imported Memory Objects
3.2.16.4.4. 将多级映射数组映射到导入的内存对象 

A CUDA mipmapped array can be mapped onto an imported memory object as shown below. The offset, dimensions, format and number of mip levels must match that specified when creating the mapping using the corresponding Direct3D 11 API. Additionally, if the mipmapped array can be bound as a render target in Direct3D 12, the flag cudaArrayColorAttachment must be set. All mapped mipmapped arrays must be freed using cudaFreeMipmappedArray(). The following code sample shows how to convert Direct3D 11 parameters into the corresponding CUDA parameters when mapping mipmapped arrays onto imported memory objects.
CUDA mipmapped 数组可以映射到导入的内存对象上,如下所示。偏移量、维度、格式和 mip 级别数量必须与使用相应的 Direct3D 11 API 创建映射时指定的内容相匹配。此外,如果 mipmapped 数组可以在 Direct3D 12 中绑定为渲染目标,则必须设置标志 cudaArrayColorAttachment 。所有映射的 mipmapped 数组必须使用 cudaFreeMipmappedArray() 释放。以下代码示例显示了在将 mipmapped 数组映射到导入的内存对象时如何将 Direct3D 11 参数转换为相应的 CUDA 参数。

cudaMipmappedArray_t mapMipmappedArrayOntoExternalMemory(cudaExternalMemory_t extMem, unsigned long long offset, cudaChannelFormatDesc *formatDesc, cudaExtent *extent, unsigned int flags, unsigned int numLevels) {
    cudaMipmappedArray_t mipmap = NULL;
    cudaExternalMemoryMipmappedArrayDesc desc = {};

    memset(&desc, 0, sizeof(desc));

    desc.offset = offset;
    desc.formatDesc = *formatDesc;
    desc.extent = *extent;
    desc.flags = flags;
    desc.numLevels = numLevels;

    // Note: 'mipmap' must eventually be freed using cudaFreeMipmappedArray()
    cudaExternalMemoryGetMappedMipmappedArray(&mipmap, extMem, &desc);

    return mipmap;
}

cudaChannelFormatDesc getCudaChannelFormatDescForDxgiFormat(DXGI_FORMAT dxgiFormat)
{
    cudaChannelFormatDesc d;
    memset(&d, 0, sizeof(d));
    switch (dxgiFormat) {
    case DXGI_FORMAT_R8_UINT:            d.x = 8;  d.y = 0;  d.z = 0;  d.w = 0;  d.f = cudaChannelFormatKindUnsigned; break;
    case DXGI_FORMAT_R8_SINT:            d.x = 8;  d.y = 0;  d.z = 0;  d.w = 0;  d.f = cudaChannelFormatKindSigned;   break;
    case DXGI_FORMAT_R8G8_UINT:          d.x = 8;  d.y = 8;  d.z = 0;  d.w = 0;  d.f = cudaChannelFormatKindUnsigned; break;
    case DXGI_FORMAT_R8G8_SINT:          d.x = 8;  d.y = 8;  d.z = 0;  d.w = 0;  d.f = cudaChannelFormatKindSigned;   break;
    case DXGI_FORMAT_R8G8B8A8_UINT:      d.x = 8;  d.y = 8;  d.z = 8;  d.w = 8;  d.f = cudaChannelFormatKindUnsigned; break;
    case DXGI_FORMAT_R8G8B8A8_SINT:      d.x = 8;  d.y = 8;  d.z = 8;  d.w = 8;  d.f = cudaChannelFormatKindSigned;   break;
    case DXGI_FORMAT_R16_UINT:           d.x = 16; d.y = 0;  d.z = 0;  d.w = 0;  d.f = cudaChannelFormatKindUnsigned; break;
    case DXGI_FORMAT_R16_SINT:           d.x = 16; d.y = 0;  d.z = 0;  d.w = 0;  d.f = cudaChannelFormatKindSigned;   break;
    case DXGI_FORMAT_R16G16_UINT:        d.x = 16; d.y = 16; d.z = 0;  d.w = 0;  d.f = cudaChannelFormatKindUnsigned; break;
    case DXGI_FORMAT_R16G16_SINT:        d.x = 16; d.y = 16; d.z = 0;  d.w = 0;  d.f = cudaChannelFormatKindSigned;   break;
    case DXGI_FORMAT_R16G16B16A16_UINT:  d.x = 16; d.y = 16; d.z = 16; d.w = 16; d.f = cudaChannelFormatKindUnsigned; break;
    case DXGI_FORMAT_R16G16B16A16_SINT:  d.x = 16; d.y = 16; d.z = 16; d.w = 16; d.f = cudaChannelFormatKindSigned;   break;
    case DXGI_FORMAT_R32_UINT:           d.x = 32; d.y = 0;  d.z = 0;  d.w = 0;  d.f = cudaChannelFormatKindUnsigned; break;
    case DXGI_FORMAT_R32_SINT:           d.x = 32; d.y = 0;  d.z = 0;  d.w = 0;  d.f = cudaChannelFormatKindSigned;   break;
    case DXGI_FORMAT_R32_FLOAT:          d.x = 32; d.y = 0;  d.z = 0;  d.w = 0;  d.f = cudaChannelFormatKindFloat;    break;
    case DXGI_FORMAT_R32G32_UINT:        d.x = 32; d.y = 32; d.z = 0;  d.w = 0;  d.f = cudaChannelFormatKindUnsigned; break;
    case DXGI_FORMAT_R32G32_SINT:        d.x = 32; d.y = 32; d.z = 0;  d.w = 0;  d.f = cudaChannelFormatKindSigned;   break;
    case DXGI_FORMAT_R32G32_FLOAT:       d.x = 32; d.y = 32; d.z = 0;  d.w = 0;  d.f = cudaChannelFormatKindFloat;    break;
    case DXGI_FORMAT_R32G32B32A32_UINT:  d.x = 32; d.y = 32; d.z = 32; d.w = 32; d.f = cudaChannelFormatKindUnsigned; break;
    case DXGI_FORMAT_R32G32B32A32_SINT:  d.x = 32; d.y = 32; d.z = 32; d.w = 32; d.f = cudaChannelFormatKindSigned;   break;
    case DXGI_FORMAT_R32G32B32A32_FLOAT: d.x = 32; d.y = 32; d.z = 32; d.w = 32; d.f = cudaChannelFormatKindFloat;    break;
    default: assert(0);
    }
    return d;
}

cudaExtent getCudaExtentForD3D11Extent(UINT64 width, UINT height, UINT16 depthOrArraySize, D3D12_SRV_DIMENSION d3d11SRVDimension) {
    cudaExtent e = { 0, 0, 0 };

    switch (d3d11SRVDimension) {
    case D3D11_SRV_DIMENSION_TEXTURE1D:        e.width = width; e.height = 0;      e.depth = 0;                break;
    case D3D11_SRV_DIMENSION_TEXTURE2D:        e.width = width; e.height = height; e.depth = 0;                break;
    case D3D11_SRV_DIMENSION_TEXTURE3D:        e.width = width; e.height = height; e.depth = depthOrArraySize; break;
    case D3D11_SRV_DIMENSION_TEXTURECUBE:      e.width = width; e.height = height; e.depth = depthOrArraySize; break;
    case D3D11_SRV_DIMENSION_TEXTURE1DARRAY:   e.width = width; e.height = 0;      e.depth = depthOrArraySize; break;
    case D3D11_SRV_DIMENSION_TEXTURE2DARRAY:   e.width = width; e.height = height; e.depth = depthOrArraySize; break;
    case D3D11_SRV_DIMENSION_TEXTURECUBEARRAY: e.width = width; e.height = height; e.depth = depthOrArraySize; break;
    default: assert(0);
    }
    return e;
}

unsigned int getCudaMipmappedArrayFlagsForD3D12Resource(D3D11_SRV_DIMENSION d3d11SRVDimension, D3D11_BIND_FLAG d3d11BindFlags, bool allowSurfaceLoadStore) {
    unsigned int flags = 0;

    switch (d3d11SRVDimension) {
    case D3D11_SRV_DIMENSION_TEXTURECUBE:      flags |= cudaArrayCubemap;                    break;
    case D3D11_SRV_DIMENSION_TEXTURECUBEARRAY: flags |= cudaArrayCubemap | cudaArrayLayered; break;
    case D3D11_SRV_DIMENSION_TEXTURE1DARRAY:   flags |= cudaArrayLayered;                    break;
    case D3D11_SRV_DIMENSION_TEXTURE2DARRAY:   flags |= cudaArrayLayered;                    break;
    default: break;
    }

    if (d3d11BindFlags & D3D11_BIND_RENDER_TARGET) {
        flags |= cudaArrayColorAttachment;
    }

    if (allowSurfaceLoadStore) {
        flags |= cudaArraySurfaceLoadStore;
    }

    return flags;
}
3.2.16.4.5. Importing Synchronization Objects
3.2.16.4.5. 导入同步对象 

A shareable Direct3D 11 fence object, created by setting the flag D3D11_FENCE_FLAG_SHARED in the call to ID3D11Device5::CreateFence, can be imported into CUDA using the NT handle associated with that object as shown below. Note that it is the application’s responsibility to close the handle when it is not required anymore. The NT handle holds a reference to the resource, so it must be explicitly freed before the underlying semaphore can be freed.
通过在调用 ID3D11Device5::CreateFence 时设置标志 D3D11_FENCE_FLAG_SHARED 创建的可共享的 Direct3D 11 围栏对象可以使用与该对象关联的 NT 句柄导入到 CUDA,如下所示。请注意,当不再需要时,应用程序有责任关闭句柄。NT 句柄持有对资源的引用,因此在底层信号量可以释放之前必须显式释放它。

cudaExternalSemaphore_t importD3D11FenceFromNTHandle(HANDLE handle) {
    cudaExternalSemaphore_t extSem = NULL;
    cudaExternalSemaphoreHandleDesc desc = {};

    memset(&desc, 0, sizeof(desc));

    desc.type = cudaExternalSemaphoreHandleTypeD3D11Fence;
    desc.handle.win32.handle = handle;

    cudaImportExternalSemaphore(&extSem, &desc);

    // Input parameter 'handle' should be closed if it's not needed anymore
    CloseHandle(handle);

    return extSem;
}

A shareable Direct3D 11 fence object can also be imported using a named handle if one exists as shown below.
可以使用命名句柄导入可共享的 Direct3D 11 围栏对象,如下所示。

cudaExternalSemaphore_t importD3D11FenceFromNamedNTHandle(LPCWSTR name) {
    cudaExternalSemaphore_t extSem = NULL;
    cudaExternalSemaphoreHandleDesc desc = {};

    memset(&desc, 0, sizeof(desc));

    desc.type = cudaExternalSemaphoreHandleTypeD3D11Fence;
    desc.handle.win32.name = (void *)name;

    cudaImportExternalSemaphore(&extSem, &desc);

    return extSem;
}

A shareable Direct3D 11 keyed mutex object associated with a shareable Direct3D 11 resource, viz, IDXGIKeyedMutex, created by setting the flag D3D11_RESOURCE_MISC_SHARED_KEYEDMUTEX, can be imported into CUDA using the NT handle associated with that object as shown below. Note that it is the application’s responsibility to close the handle when it is not required anymore. The NT handle holds a reference to the resource, so it must be explicitly freed before the underlying semaphore can be freed.
一个可共享的与可共享的 Direct3D 11 资源相关联的 Direct3D 11 键控互斥对象,即通过设置标志 D3D11_RESOURCE_MISC_SHARED_KEYEDMUTEX 创建的 IDXGIKeyedMutex ,可以使用与该对象关联的 NT 句柄导入到 CUDA 中,如下所示。请注意,当不再需要时,应用程序有责任关闭句柄。NT 句柄持有对资源的引用,因此在底层信号量可以被释放之前,必须显式释放它。

cudaExternalSemaphore_t importD3D11KeyedMutexFromNTHandle(HANDLE handle) {
    cudaExternalSemaphore_t extSem = NULL;
    cudaExternalSemaphoreHandleDesc desc = {};

    memset(&desc, 0, sizeof(desc));

    desc.type = cudaExternalSemaphoreHandleTypeKeyedMutex;
    desc.handle.win32.handle = handle;

    cudaImportExternalSemaphore(&extSem, &desc);

    // Input parameter 'handle' should be closed if it's not needed anymore
    CloseHandle(handle);

    return extSem;
}

A shareable Direct3D 11 keyed mutex object can also be imported using a named handle if one exists as shown below.
一个可共享的 Direct3D 11 键控互斥对象也可以使用一个已存在的命名句柄导入,如下所示。

cudaExternalSemaphore_t importD3D11KeyedMutexFromNamedNTHandle(LPCWSTR name) {
    cudaExternalSemaphore_t extSem = NULL;
    cudaExternalSemaphoreHandleDesc desc = {};

    memset(&desc, 0, sizeof(desc));

    desc.type = cudaExternalSemaphoreHandleTypeKeyedMutex;
    desc.handle.win32.name = (void *)name;

    cudaImportExternalSemaphore(&extSem, &desc);

    return extSem;
}

A shareable Direct3D 11 keyed mutex object can be imported into CUDA using the globally shared D3DKMT handle associated with that object as shown below. Since a globally shared D3DKMT handle does not hold a reference to the underlying memory it is automatically destroyed when all other references to the resource are destroyed.
可以将可共享的 Direct3D 11 键控互斥对象导入到 CUDA 中,使用与该对象关联的全局共享 D3DKMT 句柄,如下所示。由于全局共享的 D3DKMT 句柄不持有对底层内存的引用,当对资源的所有其他引用都被销毁时,它会自动销毁。

cudaExternalSemaphore_t importD3D11FenceFromKMTHandle(HANDLE handle) {
    cudaExternalSemaphore_t extSem = NULL;
    cudaExternalSemaphoreHandleDesc desc = {};

    memset(&desc, 0, sizeof(desc));

    desc.type = cudaExternalSemaphoreHandleTypeKeyedMutexKmt;
    desc.handle.win32.handle = handle;

    cudaImportExternalSemaphore(&extSem, &desc);

    // Input parameter 'handle' should be closed if it's not needed anymore
    CloseHandle(handle);

    return extSem;
}
3.2.16.4.6. Signaling/Waiting on Imported Synchronization Objects
3.2.16.4.6. 信号/等待导入的同步对象 

An imported Direct3D 11 fence object can be signaled as shown below. Signaling such a fence object sets its value to the one specified. The corresponding wait that waits on this signal must be issued in Direct3D 11. Additionally, the wait that waits on this signal must be issued after this signal has been issued.
导入的 Direct3D 11 围栏对象可以如下所示被标记。对这样的围栏对象进行标记会将其值设置为指定的值。等待此信号的相应等待必须在 Direct3D 11 中发出。此信号发出后,等待此信号的等待必须在此信号之后发出。

void signalExternalSemaphore(cudaExternalSemaphore_t extSem, unsigned long long value, cudaStream_t stream) {
    cudaExternalSemaphoreSignalParams params = {};

    memset(&params, 0, sizeof(params));

    params.params.fence.value = value;

    cudaSignalExternalSemaphoresAsync(&extSem, &params, 1, stream);
}

An imported Direct3D 11 fence object can be waited on as shown below. Waiting on such a fence object waits until its value becomes greater than or equal to the specified value. The corresponding signal that this wait is waiting on must be issued in Direct3D 11. Additionally, the signal must be issued before this wait can be issued.
导入的 Direct3D 11 围栏对象可以如下所示等待。等待这样的围栏对象直到其值大于或等于指定值。此等待正在等待的相应信号必须在 Direct3D 11 中发出。此外,必须在发出此等待之前发出信号。

void waitExternalSemaphore(cudaExternalSemaphore_t extSem, unsigned long long value, cudaStream_t stream) {
    cudaExternalSemaphoreWaitParams params = {};

    memset(&params, 0, sizeof(params));

    params.params.fence.value = value;

    cudaWaitExternalSemaphoresAsync(&extSem, &params, 1, stream);
}

An imported Direct3D 11 keyed mutex object can be signaled as shown below. Signaling such a keyed mutex object by specifying a key value releases the keyed mutex for that value. The corresponding wait that waits on this signal must be issued in Direct3D 11 with the same key value. Additionally, the Direct3D 11 wait must be issued after this signal has been issued.
导入的 Direct3D 11 键控互斥对象可以如下所示被标记为已发出信号。通过指定一个键值来发出这样一个键控互斥对象的信号会释放该值的键控互斥。等待此信号的对应等待必须在 Direct3D 11 中使用相同的键值发出。此外,Direct3D 11 等待必须在发出此信号后发出。

void signalExternalSemaphore(cudaExternalSemaphore_t extSem, unsigned long long key, cudaStream_t stream) {
    cudaExternalSemaphoreSignalParams params = {};

    memset(&params, 0, sizeof(params));

    params.params.keyedmutex.key = key;

    cudaSignalExternalSemaphoresAsync(&extSem, &params, 1, stream);
}

An imported Direct3D 11 keyed mutex object can be waited on as shown below. A timeout value in milliseconds is needed when waiting on such a keyed mutex. The wait operation waits until the keyed mutex value is equal to the specified key value or until the timeout has elapsed. The timeout interval can also be an infinite value. In case an infinite value is specified the timeout never elapses. The windows INFINITE macro must be used to specify an infinite timeout. The corresponding signal that this wait is waiting on must be issued in Direct3D 11. Additionally, the Direct3D 11 signal must be issued before this wait can be issued.
导入的 Direct3D 11 键控互斥对象可以如下所示等待。等待此类键控互斥时需要以毫秒为单位的超时值。等待操作会等待直到键控互斥值等于指定的键值,或者直到超时时间已过。超时间隔也可以是无限值。如果指定了无限值,则超时永远不会发生。必须使用 Windows 的 INFINITE 宏来指定无限超时。此等待正在等待的相应信号必须在 Direct3D 11 中发出。此外,必须在发出此等待之前发出 Direct3D 11 信号。

void waitExternalSemaphore(cudaExternalSemaphore_t extSem, unsigned long long key, unsigned int timeoutMs, cudaStream_t stream) {
    cudaExternalSemaphoreWaitParams params = {};

    memset(&params, 0, sizeof(params));

    params.params.keyedmutex.key = key;
    params.params.keyedmutex.timeoutMs = timeoutMs;

    cudaWaitExternalSemaphoresAsync(&extSem, &params, 1, stream);
}

3.2.16.5. NVIDIA Software Communication Interface Interoperability (NVSCI)
3.2.16.5. NVIDIA 软件通信接口互操作性 (NVSCI) 

NvSciBuf and NvSciSync are interfaces developed for serving the following purposes:
NvSciBuf 和 NvSciSync 是为以下目的开发的接口:

  • NvSciBuf: Allows applications to allocate and exchange buffers in memory
    NvSciBuf:允许应用程序在内存中分配和交换缓冲区

  • NvSciSync: Allows applications to manage synchronization objects at operation boundaries
    NvSciSync:允许应用程序在操作边界处管理同步对象

More details on these interfaces are available at: https://docs.nvidia.com/drive.
这些接口的更多详细信息可在以下网址找到:https://docs.nvidia.com/drive。

3.2.16.5.1. Importing Memory Objects
3.2.16.5.1. 导入内存对象 

For allocating an NvSciBuf object compatible with a given CUDA device, the corresponding GPU id must be set with NvSciBufGeneralAttrKey_GpuId in the NvSciBuf attribute list as shown below. Optionally, applications can specify the following attributes -
为了分配一个与给定 CUDA 设备兼容的 NvSciBuf 对象,必须在 NvSciBuf 属性列表中使用 NvSciBufGeneralAttrKey_GpuId 设置相应的 GPU id,如下所示。可选地,应用程序可以指定以下属性 -

  • NvSciBufGeneralAttrKey_NeedCpuAccess: Specifies if CPU access is required for the buffer
    NvSciBufGeneralAttrKey_NeedCpuAccess :指定是否需要对缓冲区进行 CPU 访问

  • NvSciBufRawBufferAttrKey_Align: Specifies the alignment requirement of NvSciBufType_RawBuffer
    NvSciBufRawBufferAttrKey_Align :指定 NvSciBufType_RawBuffer 的对齐要求

  • NvSciBufGeneralAttrKey_RequiredPerm: Different access permissions can be configured for different UMDs per NvSciBuf memory object instance. For example, to provide the GPU with read-only access permissions to the buffer, create a duplicate NvSciBuf object using NvSciBufObjDupWithReducePerm() with NvSciBufAccessPerm_Readonly as the input parameter. Then import this newly created duplicate object with reduced permission into CUDA as shown
    NvSciBufGeneralAttrKey_RequiredPerm :可以为每个 NvSciBuf 内存对象实例配置不同的 UMD 访问权限。例如,为了为缓冲区提供只读访问权限给 GPU,可以使用 NvSciBufObjDupWithReducePerm() 创建一个具有 NvSciBufAccessPerm_Readonly 作为输入参数的重复 NvSciBuf 对象。然后将这个新创建的权限减少的重复对象导入 CUDA。

  • NvSciBufGeneralAttrKey_EnableGpuCache: To control GPU L2 cacheability
    NvSciBufGeneralAttrKey_EnableGpuCache :控制 GPU L2 缓存性能

  • NvSciBufGeneralAttrKey_EnableGpuCompression: To specify GPU compression
    NvSciBufGeneralAttrKey_EnableGpuCompression :指定 GPU 压缩

Note 注意

For more details on these attributes and their valid input options, refer to NvSciBuf Documentation.
有关这些属性及其有效输入选项的更多详细信息,请参阅 NvSciBuf 文档。

The following code snippet illustrates their sample usage.
以下代码片段说明了它们的示例用法。

NvSciBufObj createNvSciBufObject() {
   // Raw Buffer Attributes for CUDA
    NvSciBufType bufType = NvSciBufType_RawBuffer;
    uint64_t rawsize = SIZE;
    uint64_t align = 0;
    bool cpuaccess_flag = true;
    NvSciBufAttrValAccessPerm perm = NvSciBufAccessPerm_ReadWrite;

    NvSciRmGpuId gpuid[] ={};
    CUuuid uuid;
    cuDeviceGetUuid(&uuid, dev));

    memcpy(&gpuid[0].bytes, &uuid.bytes, sizeof(uuid.bytes));
    // Disable cache on dev
    NvSciBufAttrValGpuCache gpuCache[] = {{gpuid[0], false}};
    NvSciBufAttrValGpuCompression gpuCompression[] = {{gpuid[0], NvSciBufCompressionType_GenericCompressible}};
    // Fill in values
    NvSciBufAttrKeyValuePair rawbuffattrs[] = {
         { NvSciBufGeneralAttrKey_Types, &bufType, sizeof(bufType) },
         { NvSciBufRawBufferAttrKey_Size, &rawsize, sizeof(rawsize) },
         { NvSciBufRawBufferAttrKey_Align, &align, sizeof(align) },
         { NvSciBufGeneralAttrKey_NeedCpuAccess, &cpuaccess_flag, sizeof(cpuaccess_flag) },
         { NvSciBufGeneralAttrKey_RequiredPerm, &perm, sizeof(perm) },
         { NvSciBufGeneralAttrKey_GpuId, &gpuid, sizeof(gpuid) },
         { NvSciBufGeneralAttrKey_EnableGpuCache &gpuCache, sizeof(gpuCache) },
         { NvSciBufGeneralAttrKey_EnableGpuCompression &gpuCompression, sizeof(gpuCompression) }
    };

    // Create list by setting attributes
    err = NvSciBufAttrListSetAttrs(attrListBuffer, rawbuffattrs,
            sizeof(rawbuffattrs)/sizeof(NvSciBufAttrKeyValuePair));

    NvSciBufAttrListCreate(NvSciBufModule, &attrListBuffer);

    // Reconcile And Allocate
    NvSciBufAttrListReconcile(&attrListBuffer, 1, &attrListReconciledBuffer,
                       &attrListConflictBuffer)
    NvSciBufObjAlloc(attrListReconciledBuffer, &bufferObjRaw);
    return bufferObjRaw;
}
NvSciBufObj bufferObjRo; // Readonly NvSciBuf memory obj
// Create a duplicate handle to the same memory buffer with reduced permissions
NvSciBufObjDupWithReducePerm(bufferObjRaw, NvSciBufAccessPerm_Readonly, &bufferObjRo);
return bufferObjRo;

The allocated NvSciBuf memory object can be imported in CUDA using the NvSciBufObj handle as shown below. Application should query the allocated NvSciBufObj for attributes required for filling CUDA External Memory Descriptor. Note that the attribute list and NvSciBuf objects should be maintained by the application. If the NvSciBuf object imported into CUDA is also mapped by other drivers, then based on NvSciBufGeneralAttrKey_GpuSwNeedCacheCoherency output attribute value the application must use NvSciSync objects (Refer Importing Synchronization Objects) as appropriate barriers to maintain coherence between CUDA and the other drivers.
分配的 NvSciBuf 内存对象可以使用以下所示的 NvSciBufObj 句柄在 CUDA 中导入。应用程序应查询分配的 NvSciBufObj,以获取填充 CUDA 外部内存描述符所需的属性。请注意,属性列表和 NvSciBuf 对象应由应用程序维护。如果导入到 CUDA 中的 NvSciBuf 对象也被其他驱动程序映射,那么根据 NvSciBufGeneralAttrKey_GpuSwNeedCacheCoherency 输出属性值,应用程序必须使用 NvSciSync 对象(参考导入同步对象)作为适当的屏障,以保持 CUDA 和其他驱动程序之间的一致性。

Note 注意

For more details on how to allocate and maintain NvSciBuf objects refer to NvSciBuf API Documentation.
有关如何分配和维护 NvSciBuf 对象的更多详细信息,请参阅 NvSciBuf API 文档。

cudaExternalMemory_t importNvSciBufObject (NvSciBufObj bufferObjRaw) {

    /*************** Query NvSciBuf Object **************/
    NvSciBufAttrKeyValuePair bufattrs[] = {
                { NvSciBufRawBufferAttrKey_Size, NULL, 0 },
                { NvSciBufGeneralAttrKey_GpuSwNeedCacheCoherency, NULL, 0 },
                { NvSciBufGeneralAttrKey_EnableGpuCompression, NULL, 0 }
    };
    NvSciBufAttrListGetAttrs(retList, bufattrs,
        sizeof(bufattrs)/sizeof(NvSciBufAttrKeyValuePair)));
                ret_size = *(static_cast<const uint64_t*>(bufattrs[0].value));

    // Note cache and compression are per GPU attributes, so read values for specific gpu by comparing UUID
    // Read cacheability granted by NvSciBuf
    int numGpus = bufattrs[1].len / sizeof(NvSciBufAttrValGpuCache);
    NvSciBufAttrValGpuCache[] cacheVal = (NvSciBufAttrValGpuCache *)bufattrs[1].value;
    bool ret_cacheVal;
    for (int i = 0; i < numGpus; i++) {
        if (memcmp(gpuid[0].bytes, cacheVal[i].gpuId.bytes, sizeof(CUuuid)) == 0) {
            ret_cacheVal = cacheVal[i].cacheability);
        }
    }

    // Read compression granted by NvSciBuf
    numGpus = bufattrs[2].len / sizeof(NvSciBufAttrValGpuCompression);
    NvSciBufAttrValGpuCompression[] compVal = (NvSciBufAttrValGpuCompression *)bufattrs[2].value;
    NvSciBufCompressionType ret_compVal;
    for (int i = 0; i < numGpus; i++) {
        if (memcmp(gpuid[0].bytes, compVal[i].gpuId.bytes, sizeof(CUuuid)) == 0) {
            ret_compVal = compVal[i].compressionType);
        }
    }

    /*************** NvSciBuf Registration With CUDA **************/

    // Fill up CUDA_EXTERNAL_MEMORY_HANDLE_DESC
    cudaExternalMemoryHandleDesc memHandleDesc;
    memset(&memHandleDesc, 0, sizeof(memHandleDesc));
    memHandleDesc.type = cudaExternalMemoryHandleTypeNvSciBuf;
    memHandleDesc.handle.nvSciBufObject = bufferObjRaw;
    // Set the NvSciBuf object with required access permissions in this step
    memHandleDesc.handle.nvSciBufObject = bufferObjRo;
    memHandleDesc.size = ret_size;
    cudaImportExternalMemory(&extMemBuffer, &memHandleDesc);
    return extMemBuffer;
 }
3.2.16.5.2. Mapping Buffers onto Imported Memory Objects
3.2.16.5.2. 将缓冲区映射到导入的内存对象 

A device pointer can be mapped onto an imported memory object as shown below. The offset and size of the mapping can be filled as per the attributes of the allocated NvSciBufObj. All mapped device pointers must be freed using cudaFree().
设备指针可以映射到导入的内存对象上,如下所示。映射的偏移量和大小可以根据分配的 NvSciBufObj 的属性填充。所有映射的设备指针必须使用 cudaFree() 释放。

void * mapBufferOntoExternalMemory(cudaExternalMemory_t extMem, unsigned long long offset, unsigned long long size) {
    void *ptr = NULL;
    cudaExternalMemoryBufferDesc desc = {};

    memset(&desc, 0, sizeof(desc));

    desc.offset = offset;
    desc.size = size;

    cudaExternalMemoryGetMappedBuffer(&ptr, extMem, &desc);

    // Note: 'ptr' must eventually be freed using cudaFree()
    return ptr;
}
3.2.16.5.3. Mapping Mipmapped Arrays onto Imported Memory Objects
3.2.16.5.3. 将多级映射数组映射到导入的内存对象 

A CUDA mipmapped array can be mapped onto an imported memory object as shown below. The offset, dimensions and format can be filled as per the attributes of the allocated NvSciBufObj. All mapped mipmapped arrays must be freed using cudaFreeMipmappedArray(). The following code sample shows how to convert NvSciBuf attributes into the corresponding CUDA parameters when mapping mipmapped arrays onto imported memory objects.
CUDA mipmapped 数组可以映射到导入的内存对象上,如下所示。偏移量、维度和格式可以根据分配的 NvSciBufObj 的属性填充。所有映射的 mipmapped 数组必须使用 cudaFreeMipmappedArray() 释放。以下代码示例显示了将 NvSciBuf 属性转换为相应的 CUDA 参数的过程,用于将 mipmapped 数组映射到导入的内存对象上。

Note 注意

The number of mip levels must be 1.
Mip 级别的数量必须为 1。

cudaMipmappedArray_t mapMipmappedArrayOntoExternalMemory(cudaExternalMemory_t extMem, unsigned long long offset, cudaChannelFormatDesc *formatDesc, cudaExtent *extent, unsigned int flags, unsigned int numLevels) {
    cudaMipmappedArray_t mipmap = NULL;
    cudaExternalMemoryMipmappedArrayDesc desc = {};

    memset(&desc, 0, sizeof(desc));

    desc.offset = offset;
    desc.formatDesc = *formatDesc;
    desc.extent = *extent;
    desc.flags = flags;
    desc.numLevels = numLevels;

    // Note: 'mipmap' must eventually be freed using cudaFreeMipmappedArray()
    cudaExternalMemoryGetMappedMipmappedArray(&mipmap, extMem, &desc);

    return mipmap;
}
3.2.16.5.4. Importing Synchronization Objects
3.2.16.5.4. 导入同步对象 

NvSciSync attributes that are compatible with a given CUDA device can be generated using cudaDeviceGetNvSciSyncAttributes(). The returned attribute list can be used to create a NvSciSyncObj that is guaranteed compatibility with a given CUDA device.
NvSciSync 属性与给定的 CUDA 设备兼容,可以使用 cudaDeviceGetNvSciSyncAttributes() 生成。返回的属性列表可用于创建与给定 CUDA 设备兼容的 NvSciSyncObj

NvSciSyncObj createNvSciSyncObject() {
    NvSciSyncObj nvSciSyncObj
    int cudaDev0 = 0;
    int cudaDev1 = 1;
    NvSciSyncAttrList signalerAttrList = NULL;
    NvSciSyncAttrList waiterAttrList = NULL;
    NvSciSyncAttrList reconciledList = NULL;
    NvSciSyncAttrList newConflictList = NULL;

    NvSciSyncAttrListCreate(module, &signalerAttrList);
    NvSciSyncAttrListCreate(module, &waiterAttrList);
    NvSciSyncAttrList unreconciledList[2] = {NULL, NULL};
    unreconciledList[0] = signalerAttrList;
    unreconciledList[1] = waiterAttrList;

    cudaDeviceGetNvSciSyncAttributes(signalerAttrList, cudaDev0, CUDA_NVSCISYNC_ATTR_SIGNAL);
    cudaDeviceGetNvSciSyncAttributes(waiterAttrList, cudaDev1, CUDA_NVSCISYNC_ATTR_WAIT);

    NvSciSyncAttrListReconcile(unreconciledList, 2, &reconciledList, &newConflictList);

    NvSciSyncObjAlloc(reconciledList, &nvSciSyncObj);

    return nvSciSyncObj;
}

An NvSciSync object (created as above) can be imported into CUDA using the NvSciSyncObj handle as shown below. Note that ownership of the NvSciSyncObj handle continues to lie with the application even after it is imported.
一个 NvSciSync 对象(如上所创建)可以使用 NvSciSyncObj 句柄导入到 CUDA 中,如下所示。请注意,即使导入后,NvSciSyncObj 句柄的所有权仍然属于应用程序。

cudaExternalSemaphore_t importNvSciSyncObject(void* nvSciSyncObj) {
    cudaExternalSemaphore_t extSem = NULL;
    cudaExternalSemaphoreHandleDesc desc = {};

    memset(&desc, 0, sizeof(desc));

    desc.type = cudaExternalSemaphoreHandleTypeNvSciSync;
    desc.handle.nvSciSyncObj = nvSciSyncObj;

    cudaImportExternalSemaphore(&extSem, &desc);

    // Deleting/Freeing the nvSciSyncObj beyond this point will lead to undefined behavior in CUDA

    return extSem;
}
3.2.16.5.5. Signaling/Waiting on Imported Synchronization Objects
3.2.16.5.5. 信号/等待导入的同步对象 

An imported NvSciSyncObj object can be signaled as outlined below. Signaling NvSciSync backed semaphore object initializes the fence parameter passed as input. This fence parameter is waited upon by a wait operation that corresponds to the aforementioned signal. Additionally, the wait that waits on this signal must be issued after this signal has been issued. If the flags are set to cudaExternalSemaphoreSignalSkipNvSciBufMemSync then memory synchronization operations (over all the imported NvSciBuf in this process) that are executed as a part of the signal operation by default are skipped. When NvsciBufGeneralAttrKey_GpuSwNeedCacheCoherency is FALSE, this flag should be set.
导入的 NvSciSyncObj 对象可以按照下面的方式进行信号传递。发出 NvSciSync 支持的信号量对象会初始化作为输入传递的 fence 参数。这个 fence 参数会被等待操作等待,该操作对应于前述的信号。此外,等待此信号的等待操作必须在发出此信号之后发出。如果标志设置为 cudaExternalSemaphoreSignalSkipNvSciBufMemSync ,则默认情况下会跳过作为信号操作的一部分执行的内存同步操作(在此过程中执行的所有导入的 NvSciBuf)。当 NvsciBufGeneralAttrKey_GpuSwNeedCacheCoherency 为 FALSE 时,应设置此标志。

void signalExternalSemaphore(cudaExternalSemaphore_t extSem, cudaStream_t stream, void *fence) {
    cudaExternalSemaphoreSignalParams signalParams = {};

    memset(&signalParams, 0, sizeof(signalParams));

    signalParams.params.nvSciSync.fence = (void*)fence;
    signalParams.flags = 0; //OR cudaExternalSemaphoreSignalSkipNvSciBufMemSync

    cudaSignalExternalSemaphoresAsync(&extSem, &signalParams, 1, stream);

}

An imported NvSciSyncObj object can be waited upon as outlined below. Waiting on NvSciSync backed semaphore object waits until the input fence parameter is signaled by the corresponding signaler. Additionally, the signal must be issued before the wait can be issued. If the flags are set to cudaExternalSemaphoreWaitSkipNvSciBufMemSync then memory synchronization operations (over all the imported NvSciBuf in this process) that are executed as a part of the signal operation by default are skipped. When NvsciBufGeneralAttrKey_GpuSwNeedCacheCoherency is FALSE, this flag should be set.
导入的 NvSciSyncObj 对象可以按照下面的概述等待。等待 NvSciSync 支持的信号量对象会等到输入的 fence 参数被相应的信号发出者发出信号。此外,必须在发出等待之前发出信号。如果标志设置为 cudaExternalSemaphoreWaitSkipNvSciBufMemSync ,则默认跳过作为信号操作的一部分执行的内存同步操作(在此进程中导入的所有 NvSciBuf)。当 NvsciBufGeneralAttrKey_GpuSwNeedCacheCoherency 为 FALSE 时,应设置此标志。

void waitExternalSemaphore(cudaExternalSemaphore_t extSem, cudaStream_t stream, void *fence) {
     cudaExternalSemaphoreWaitParams waitParams = {};

    memset(&waitParams, 0, sizeof(waitParams));

    waitParams.params.nvSciSync.fence = (void*)fence;
    waitParams.flags = 0; //OR cudaExternalSemaphoreWaitSkipNvSciBufMemSync

    cudaWaitExternalSemaphoresAsync(&extSem, &waitParams, 1, stream);
}

3.3. Versioning and Compatibility
3.3. 版本控制和兼容性 

There are two version numbers that developers should care about when developing a CUDA application: The compute capability that describes the general specifications and features of the compute device (see Compute Capability) and the version of the CUDA driver API that describes the features supported by the driver API and runtime.
开发 CUDA 应用程序时,开发人员应关注两个版本号:描述计算设备的一般规格和特性的计算能力(请参阅计算能力)以及描述驱动程序 API 和运行时支持的特性的 CUDA 驱动程序 API 版本。

The version of the driver API is defined in the driver header file as CUDA_VERSION. It allows developers to check whether their application requires a newer device driver than the one currently installed. This is important, because the driver API is backward compatible, meaning that applications, plug-ins, and libraries (including the CUDA runtime) compiled against a particular version of the driver API will continue to work on subsequent device driver releases as illustrated in Figure 12. The driver API is not forward compatible, which means that applications, plug-ins, and libraries (including the CUDA runtime) compiled against a particular version of the driver API will not work on previous versions of the device driver.
驱动程序 API 的版本在驱动程序头文件中定义为 CUDA_VERSION 。它允许开发人员检查他们的应用程序是否需要比当前安装的驱动程序更新的设备驱动程序。这很重要,因为驱动程序 API 具有向后兼容性,这意味着针对特定版本的驱动程序 API 编译的应用程序、插件和库(包括 CUDA 运行时)将继续在后续设备驱动程序发布中运行,如图 12 所示。驱动程序 API 不具有向前兼容性,这意味着针对特定版本的驱动程序 API 编译的应用程序、插件和库(包括 CUDA 运行时)将无法在设备驱动程序的早期版本上运行。

It is important to note that there are limitations on the mixing and matching of versions that is supported:
需要注意的是,对支持的版本进行混合和匹配存在一定的限制:

  • Since only one version of the CUDA Driver can be installed at a time on a system, the installed driver must be of the same or higher version than the maximum Driver API version against which any application, plug-ins, or libraries that must run on that system were built.
    由于系统一次只能安装一个版本的 CUDA 驱动程序,因此安装的驱动程序必须与系统上构建的任何应用程序、插件或库必须运行的最大 Driver API 版本相同或更高版本。

  • All plug-ins and libraries used by an application must use the same version of the CUDA Runtime unless they statically link to the Runtime, in which case multiple versions of the runtime can coexist in the same process space. Note that if nvcc is used to link the application, the static version of the CUDA Runtime library will be used by default, and all CUDA Toolkit libraries are statically linked against the CUDA Runtime.
    应用程序使用的所有插件和库必须使用相同版本的 CUDA Runtime,除非它们静态链接到 Runtime,在这种情况下,多个版本的 Runtime 可以共存于同一进程空间中。请注意,如果使用 nvcc 来链接应用程序,则默认将使用 CUDA Runtime 库的静态版本,并且所有 CUDA Toolkit 库都静态链接到 CUDA Runtime。

  • All plug-ins and libraries used by an application must use the same version of any libraries that use the runtime (such as cuFFT, cuBLAS, …) unless statically linking to those libraries.
    应用程序使用的所有插件和库必须使用与运行时使用的任何库相同的版本(如 cuFFT、cuBLAS 等),除非静态链接到这些库。

The Driver API Is Backward but Not Forward Compatible

Figure 25 The Driver API Is Backward but Not Forward Compatible
图 25 驱动程序 API 向后兼容但不向前兼容 

For Tesla GPU products, CUDA 10 introduced a new forward-compatible upgrade path for the user-mode components of the CUDA Driver. This feature is described in CUDA Compatibility. The requirements on the CUDA Driver version described here apply to the version of the user-mode components.
对于 Tesla GPU 产品,CUDA 10 引入了一条新的向前兼容升级路径,用于 CUDA 驱动程序的用户模式组件。此功能在 CUDA 兼容性中有所描述。此处描述的 CUDA 驱动程序版本要求适用于用户模式组件的版本。

3.4. Compute Modes
3.4. 计算模式 

On Tesla solutions running Windows Server 2008 and later or Linux, one can set any device in a system in one of the three following modes using NVIDIA’s System Management Interface (nvidia-smi), which is a tool distributed as part of the driver:
在运行 Windows Server 2008 及更高版本或 Linux 的 Tesla 解决方案上,可以使用 NVIDIA 的系统管理接口(nvidia-smi)将系统中的任何设备设置为以下三种模式之一,nvidia-smi 是作为驱动程序的一部分分发的工具:

  • Default compute mode: Multiple host threads can use the device (by calling cudaSetDevice() on this device, when using the runtime API, or by making current a context associated to the device, when using the driver API) at the same time.
    默认计算模式:多个主机线程可以同时使用该设备(通过运行时 API 调用 cudaSetDevice() 在该设备上,或者通过使与该设备关联的上下文成为当前上下文,当使用驱动程序 API 时)。

  • Exclusive-process compute mode: Only one CUDA context may be created on the device across all processes in the system. The context may be current to as many threads as desired within the process that created that context.
    独占式进程计算模式:在系统中的所有进程中,设备上只能创建一个 CUDA 上下文。该上下文可以在创建该上下文的进程中作为许多线程的当前上下文。

  • Prohibited compute mode: No CUDA context can be created on the device.
    禁止的计算模式:无法在设备上创建 CUDA 上下文。

This means, in particular, that a host thread using the runtime API without explicitly calling cudaSetDevice() might be associated with a device other than device 0 if device 0 turns out to be in prohibited mode or in exclusive-process mode and used by another process. cudaSetValidDevices() can be used to set a device from a prioritized list of devices.
这意味着,特别是,如果设备 0 处于禁止模式或独占进程模式并被另一个进程使用,那么使用运行时 API 而没有显式调用 cudaSetDevice() 的主机线程可能与设备 0 以外的设备相关联。 cudaSetValidDevices() 可用于从设备的优先列表中设置设备。

Note also that, for devices featuring the Pascal architecture onwards (compute capability with major revision number 6 and higher), there exists support for Compute Preemption. This allows compute tasks to be preempted at instruction-level granularity, rather than thread block granularity as in prior Maxwell and Kepler GPU architecture, with the benefit that applications with long-running kernels can be prevented from either monopolizing the system or timing out. However, there will be context switch overheads associated with Compute Preemption, which is automatically enabled on those devices for which support exists. The individual attribute query function cudaDeviceGetAttribute() with the attribute cudaDevAttrComputePreemptionSupported can be used to determine if the device in use supports Compute Preemption. Users wishing to avoid context switch overheads associated with different processes can ensure that only one process is active on the GPU by selecting exclusive-process mode.
请注意,对于采用帕斯卡架构及更高版本(具有主要修订号为 6 及更高版本的计算能力)的设备,支持计算抢占。这允许计算任务以指令级粒度而非以前的麦克斯韦和开普勒 GPU 架构中的线程块粒度被抢占,从而防止长时间运行的内核应用程序垄断系统或超时。然而,计算抢占会带来上下文切换开销,对于支持的设备,计算抢占会自动启用。可以使用具有属性 cudaDevAttrComputePreemptionSupported 的个别属性查询函数 cudaDeviceGetAttribute() 来确定正在使用的设备是否支持计算抢占。希望避免与不同进程相关的上下文切换开销的用户可以通过选择独占进程模式来确保 GPU 上只有一个进程处于活动状态。

Applications may query the compute mode of a device by checking the computeMode device property (see Device Enumeration).
应用程序可以通过检查 computeMode 设备属性(请参阅设备枚举)来查询设备的计算模式。

3.5. Mode Switches
3.5. 模式切换 

GPUs that have a display output dedicate some DRAM memory to the so-called primary surface, which is used to refresh the display device whose output is viewed by the user. When users initiate a mode switch of the display by changing the resolution or bit depth of the display (using NVIDIA control panel or the Display control panel on Windows), the amount of memory needed for the primary surface changes. For example, if the user changes the display resolution from 1280x1024x32-bit to 1600x1200x32-bit, the system must dedicate 7.68 MB to the primary surface rather than 5.24 MB. (Full-screen graphics applications running with anti-aliasing enabled may require much more display memory for the primary surface.) On Windows, other events that may initiate display mode switches include launching a full-screen DirectX application, hitting Alt+Tab to task switch away from a full-screen DirectX application, or hitting Ctrl+Alt+Del to lock the computer.
具有显示输出的 GPU 会将一些 DRAM 内存专用于所谓的主表面,用于刷新用户查看的显示设备的输出。当用户通过更改显示的分辨率或位深度(使用 NVIDIA 控制面板或 Windows 上的显示控制面板)来启动显示模式切换时,主表面所需的内存量会发生变化。例如,如果用户将显示分辨率从 1280x1024x32 位更改为 1600x1200x32 位,则系统必须将 7.68 MB 分配给主表面,而不是 5.24 MB。(在启用抗锯齿的全屏图形应用程序运行时,可能需要更多的显示内存用于主表面。)在 Windows 上,可能会触发显示模式切换的其他事件包括启动全屏 DirectX 应用程序、按 Alt+Tab 切换到任务切换以退出全屏 DirectX 应用程序,或按 Ctrl+Alt+Del 锁定计算机。

If a mode switch increases the amount of memory needed for the primary surface, the system may have to cannibalize memory allocations dedicated to CUDA applications. Therefore, a mode switch results in any call to the CUDA runtime to fail and return an invalid context error.
如果模式切换增加了主表面所需的内存量,系统可能不得不吞并专门用于 CUDA 应用程序的内存分配。因此,模式切换会导致对 CUDA 运行时的任何调用失败并返回无效的上下文错误。

3.6. Tesla Compute Cluster Mode for Windows
3.6. 适用于 Windows 的特斯拉计算集群模式 

Using NVIDIA’s System Management Interface (nvidia-smi), the Windows device driver can be put in TCC (Tesla Compute Cluster) mode for devices of the Tesla and Quadro Series.
使用 NVIDIA 的系统管理接口(nvidia-smi),可以将 Windows 设备驱动程序置于 Tesla 和 Quadro 系列设备的 TCC(Tesla Compute Cluster)模式。

TCC mode removes support for any graphics functionality.
TCC 模式移除了对任何图形功能的支持。

4. Hardware Implementation
4. 硬件实现 

The NVIDIA GPU architecture is built around a scalable array of multithreaded Streaming Multiprocessors (SMs). When a CUDA program on the host CPU invokes a kernel grid, the blocks of the grid are enumerated and distributed to multiprocessors with available execution capacity. The threads of a thread block execute concurrently on one multiprocessor, and multiple thread blocks can execute concurrently on one multiprocessor. As thread blocks terminate, new blocks are launched on the vacated multiprocessors.
NVIDIA GPU 架构是围绕可扩展的多线程流处理器(SM)阵列构建的。当主机 CPU 上的 CUDA 程序调用内核网格时,网格的块被枚举并分配给具有可用执行能力的多处理器。线程块的线程在一个多处理器上并发执行,并且多个线程块可以在一个多处理器上并发执行。随着线程块的终止,新的块将在空闲的多处理器上启动。

A multiprocessor is designed to execute hundreds of threads concurrently. To manage such a large number of threads, it employs a unique architecture called SIMT (Single-Instruction, Multiple-Thread) that is described in SIMT Architecture. The instructions are pipelined, leveraging instruction-level parallelism within a single thread, as well as extensive thread-level parallelism through simultaneous hardware multithreading as detailed in Hardware Multithreading. Unlike CPU cores, they are issued in order and there is no branch prediction or speculative execution.
多处理器被设计用于同时执行数百个线程。为了管理如此大量的线程,它采用了一种称为 SIMT(单指令,多线程)的独特架构,该架构在 SIMT 架构中进行了描述。指令被流水线化,利用单个线程内的指令级并行性,以及通过硬件多线程同时进行广泛的线程级并行性,详细信息请参阅硬件多线程。与 CPU 核心不同,它们按顺序发出,并且没有分支预测或投机执行。

SIMT Architecture and Hardware Multithreading describe the architecture features of the streaming multiprocessor that are common to all devices. Compute Capability 5.x, Compute Capability 6.x, and Compute Capability 7.x provide the specifics for devices of compute capabilities 5.x, 6.x, and 7.x respectively.
SIMT 架构和硬件多线程描述了流式多处理器的架构特性,这些特性适用于所有设备。计算能力 5.x、计算能力 6.x 和计算能力 7.x 分别提供了计算能力为 5.x、6.x 和 7.x 的设备的具体信息。

The NVIDIA GPU architecture uses a little-endian representation.
NVIDIA GPU 架构使用小端表示。

4.1. SIMT Architecture
4.1. SIMT 架构 

The multiprocessor creates, manages, schedules, and executes threads in groups of 32 parallel threads called warps. Individual threads composing a warp start together at the same program address, but they have their own instruction address counter and register state and are therefore free to branch and execute independently. The term warp originates from weaving, the first parallel thread technology. A half-warp is either the first or second half of a warp. A quarter-warp is either the first, second, third, or fourth quarter of a warp.
多处理器以 32 个并行线程组成的组为单位创建、管理、调度和执行线程,这些组称为 warp。组成 warp 的单个线程在相同的程序地址处同时启动,但它们有自己的指令地址计数器和寄存器状态,因此可以自由分支和独立执行。warp 一词源自编织,是第一个并行线程技术。半个 warp 是 warp 的第一半或第二半。四分之一 warp 是 warp 的第一、第二、第三或第四个四分之一。

When a multiprocessor is given one or more thread blocks to execute, it partitions them into warps and each warp gets scheduled by a warp scheduler for execution. The way a block is partitioned into warps is always the same; each warp contains threads of consecutive, increasing thread IDs with the first warp containing thread 0. Thread Hierarchy describes how thread IDs relate to thread indices in the block.
当多处理器被分配一个或多个线程块来执行时,它将它们划分为线程束,每个线程束由线程束调度器安排执行。将一个块划分为线程束的方式始终相同;每个线程束包含连续递增的线程 ID 的线程,第一个线程束包含线程 0。线程层次结构描述了线程 ID 如何与块中的线程索引相关联。

A warp executes one common instruction at a time, so full efficiency is realized when all 32 threads of a warp agree on their execution path. If threads of a warp diverge via a data-dependent conditional branch, the warp executes each branch path taken, disabling threads that are not on that path. Branch divergence occurs only within a warp; different warps execute independently regardless of whether they are executing common or disjoint code paths.
一个 warp 一次执行一个常见指令,因此当 warp 的 32 个线程在执行路径上达成一致时,才能实现最高效率。如果 warp 的线程通过数据相关的条件分支发散,warp 将执行每个分支路径,禁用不在该路径上的线程。分支发散仅在 warp 内部发生;不同的 warp 无论执行共同代码路径还是不同代码路径,都会独立执行。

The SIMT architecture is akin to SIMD (Single Instruction, Multiple Data) vector organizations in that a single instruction controls multiple processing elements. A key difference is that SIMD vector organizations expose the SIMD width to the software, whereas SIMT instructions specify the execution and branching behavior of a single thread. In contrast with SIMD vector machines, SIMT enables programmers to write thread-level parallel code for independent, scalar threads, as well as data-parallel code for coordinated threads. For the purposes of correctness, the programmer can essentially ignore the SIMT behavior; however, substantial performance improvements can be realized by taking care that the code seldom requires threads in a warp to diverge. In practice, this is analogous to the role of cache lines in traditional code: Cache line size can be safely ignored when designing for correctness but must be considered in the code structure when designing for peak performance. Vector architectures, on the other hand, require the software to coalesce loads into vectors and manage divergence manually.
SIMT 架构类似于 SIMD(单指令,多数据)向量组织,其中单个指令控制多个处理元素。一个关键区别是,SIMD 向量组织将 SIMD 宽度暴露给软件,而 SIMT 指令指定单个线程的执行和分支行为。与 SIMD 向量机相比,SIMT 使程序员能够为独立的标量线程编写线程级并行代码,以及为协调线程编写数据并行代码。为了正确性,程序员可以基本忽略 SIMT 行为;然而,通过小心确保代码很少需要 warp 中的线程分歧,可以实现显著的性能改进。在实践中,这类似于传统代码中缓存行的作用:当设计正确性时,可以安全地忽略缓存行大小,但在设计最佳性能时必须考虑代码结构。另一方面,向量架构要求软件将加载合并为向量,并手动管理分歧。

Prior to NVIDIA Volta, warps used a single program counter shared amongst all 32 threads in the warp together with an active mask specifying the active threads of the warp. As a result, threads from the same warp in divergent regions or different states of execution cannot signal each other or exchange data, and algorithms requiring fine-grained sharing of data guarded by locks or mutexes can easily lead to deadlock, depending on which warp the contending threads come from.
在 NVIDIA Volta 之前,warp 使用一个程序计数器,该计数器在 warp 中的所有 32 个线程之间共享,并且还有一个活动掩码,指定 warp 的活动线程。 结果,处于分歧区域或执行不同状态的相同 warp 中的线程无法相互信号或交换数据,并且需要由锁或互斥锁保护的数据的细粒度共享的算法可能会导致死锁,具体取决于竞争线程来自哪个 warp。

Starting with the NVIDIA Volta architecture, Independent Thread Scheduling allows full concurrency between threads, regardless of warp. With Independent Thread Scheduling, the GPU maintains execution state per thread, including a program counter and call stack, and can yield execution at a per-thread granularity, either to make better use of execution resources or to allow one thread to wait for data to be produced by another. A schedule optimizer determines how to group active threads from the same warp together into SIMT units. This retains the high throughput of SIMT execution as in prior NVIDIA GPUs, but with much more flexibility: threads can now diverge and reconverge at sub-warp granularity.
从 NVIDIA Volta 架构开始,独立线程调度允许线程之间实现完全并发,无论是哪个 warp。通过独立线程调度,GPU 会为每个线程维护执行状态,包括程序计数器和调用堆栈,并且可以以每个线程的粒度让出执行权,以更好地利用执行资源或允许一个线程等待另一个线程产生数据。调度优化器确定如何将来自同一 warp 的活动线程分组到 SIMT 单元中。这保留了与之前的 NVIDIA GPU 相同的 SIMT 执行高吞吐量,但具有更大的灵活性:现在线程可以以子 warp 粒度分歧和重新汇合。

Independent Thread Scheduling can lead to a rather different set of threads participating in the executed code than intended if the developer made assumptions about warp-synchronicity2 of previous hardware architectures. In particular, any warp-synchronous code (such as synchronization-free, intra-warp reductions) should be revisited to ensure compatibility with NVIDIA Volta and beyond. See Compute Capability 7.x for further details.
独立线程调度可能导致参与执行代码的线程集合与开发人员对先前硬件架构的 warp 同步性 2 做出的假设有所不同。特别是,任何 warp 同步代码(如无需同步的 warp 内约简)都应重新审视,以确保与 NVIDIA Volta 及更高版本兼容。有关更多详细信息,请参阅计算能力 7.x。

Note 注意

The threads of a warp that are participating in the current instruction are called the active threads, whereas threads not on the current instruction are inactive (disabled). Threads can be inactive for a variety of reasons including having exited earlier than other threads of their warp, having taken a different branch path than the branch path currently executed by the warp, or being the last threads of a block whose number of threads is not a multiple of the warp size.
参与当前指令的经纬线称为活跃经纬线,而不在当前指令上的经纬线是非活跃的(已禁用)。 经纬线可能因为各种原因而处于非活跃状态,包括比其它经纬线更早退出、选择了与当前经纬线执行的分支路径不同的分支路径,或者是块的最后几个经纬线,其经纬线数量不是经纬线大小的倍数。

If a non-atomic instruction executed by a warp writes to the same location in global or shared memory for more than one of the threads of the warp, the number of serialized writes that occur to that location varies depending on the compute capability of the device (see Compute Capability 5.x, Compute Capability 6.x, and Compute Capability 7.x), and which thread performs the final write is undefined.
如果一个 warp 执行的非原子指令向全局或共享内存的同一位置写入超过一个线程,那么发生在该位置的串行写入次数取决于设备的计算能力(请参阅计算能力 5.x、计算能力 6.x 和计算能力 7.x),并且执行最终写入的线程是未定义的。

If an atomic instruction executed by a warp reads, modifies, and writes to the same location in global memory for more than one of the threads of the warp, each read/modify/write to that location occurs and they are all serialized, but the order in which they occur is undefined.
如果一个 warp 执行的原子指令读取、修改并写入全局内存中的同一位置超过一个线程,那么对该位置的每次读取/修改/写入都会发生,并且它们都是串行化的,但发生的顺序是未定义的。

4.2. Hardware Multithreading
4.2. 硬件多线程 

The execution context (program counters, registers, and so on) for each warp processed by a multiprocessor is maintained on-chip during the entire lifetime of the warp. Therefore, switching from one execution context to another has no cost, and at every instruction issue time, a warp scheduler selects a warp that has threads ready to execute its next instruction (the active threads of the warp) and issues the instruction to those threads.
每个多处理器处理的 warp 的执行上下文(程序计数器、寄存器等)在整个 warp 的生命周期内都在芯片上维护。因此,从一个执行上下文切换到另一个执行上下文是没有成本的,在每个指令发出时间,warp 调度程序会选择一个准备执行其下一条指令的 warp(warp 的活动线程),并将指令发出给这些线程。

In particular, each multiprocessor has a set of 32-bit registers that are partitioned among the warps, and a parallel data cache or shared memory that is partitioned among the thread blocks.
特别是,每个多处理器都有一组 32 位寄存器,这些寄存器在线程束之间进行分区,并且有一个并行数据缓存或共享内存,这些缓存在线程块之间进行分区。

The number of blocks and warps that can reside and be processed together on the multiprocessor for a given kernel depends on the amount of registers and shared memory used by the kernel and the amount of registers and shared memory available on the multiprocessor. There are also a maximum number of resident blocks and a maximum number of resident warps per multiprocessor. These limits as well the amount of registers and shared memory available on the multiprocessor are a function of the compute capability of the device and are given in Compute Capabilities. If there are not enough registers or shared memory available per multiprocessor to process at least one block, the kernel will fail to launch.
对于给定内核,可以同时驻留和处理的块数和线程束数取决于内核使用的寄存器和共享内存量,以及多处理器上可用的寄存器和共享内存量。每个多处理器还有最大驻留块数和最大驻留线程束数。这些限制以及多处理器上可用的寄存器和共享内存量是设备的计算能力的函数,并在计算能力中给出。如果每个多处理器上没有足够的寄存器或共享内存可用于处理至少一个块,则内核将无法启动。

The total number of warps in a block is as follows:
一个块中的总线程束数量如下:

ceil(TWsize,1)

  • T is the number of threads per block,
    T 是每个块的线程数,

  • Wsize is the warp size, which is equal to 32,
    Wsize 是 warp 大小,等于 32,

  • ceil(x, y) is equal to x rounded up to the nearest multiple of y.
    ceil(x, y) 等于将 x 四舍五入到最接近的 y 的倍数。

The total number of registers and total amount of shared memory allocated for a block are documented in the CUDA Occupancy Calculator provided in the CUDA Toolkit.
CUDA 工具包中提供的 CUDA 占用率计算器中记录了为一个块分配的寄存器总数和共享内存总量。

2

The term warp-synchronous refers to code that implicitly assumes threads in the same warp are synchronized at every instruction.
术语“warp-synchronous”指的是代码隐式假定同一 warp 中的线程在每条指令处于同步状态。

5. Performance Guidelines
5. 性能指南 

5.1. Overall Performance Optimization Strategies
5.1. 总体性能优化策略 

Performance optimization revolves around four basic strategies:
性能优化围绕着四个基本策略展开:

  • Maximize parallel execution to achieve maximum utilization;
    最大化并行执行以实现最大利用率;

  • Optimize memory usage to achieve maximum memory throughput;
    优化内存使用以实现最大内存吞吐量;

  • Optimize instruction usage to achieve maximum instruction throughput;
    优化指令使用以实现最大指令吞吐量;

  • Minimize memory thrashing.
    最小化内存抖动。

Which strategies will yield the best performance gain for a particular portion of an application depends on the performance limiters for that portion; optimizing instruction usage of a kernel that is mostly limited by memory accesses will not yield any significant performance gain, for example. Optimization efforts should therefore be constantly directed by measuring and monitoring the performance limiters, for example using the CUDA profiler. Also, comparing the floating-point operation throughput or memory throughput—whichever makes more sense—of a particular kernel to the corresponding peak theoretical throughput of the device indicates how much room for improvement there is for the kernel.
特定应用程序部分的最佳性能增益策略取决于该部分的性能限制因素;例如,优化主要受内存访问限制的内核的指令使用率不会带来任何显著的性能增益。因此,优化工作应该通过测量和监控性能限制因素来不断指导,例如使用 CUDA 分析器。此外,将特定内核的浮点操作吞吐量或内存吞吐量与设备的对应峰值理论吞吐量进行比较,可以指示内核还有多少改进空间。

5.2. Maximize Utilization
5.2. 最大化利用率 

To maximize utilization the application should be structured in a way that it exposes as much parallelism as possible and efficiently maps this parallelism to the various components of the system to keep them busy most of the time.
为了最大化利用率,应该以一种方式构建应用程序,使其尽可能多地暴露并高效地映射系统各个组件的并行性,以使它们大部分时间保持繁忙状态。

5.2.1. Application Level
5.2.1. 应用程序级别 

At a high level, the application should maximize parallel execution between the host, the devices, and the bus connecting the host to the devices, by using asynchronous functions calls and streams as described in Asynchronous Concurrent Execution. It should assign to each processor the type of work it does best: serial workloads to the host; parallel workloads to the devices.
在高层次上,应用程序应通过使用异步函数调用和流来最大程度地实现主机、设备和连接主机与设备的总线之间的并行执行,如异步并发执行中所述。它应该将每个处理器分配给它最擅长的工作类型:将串行工作负载分配给主机;将并行工作负载分配给设备。

For the parallel workloads, at points in the algorithm where parallelism is broken because some threads need to synchronize in order to share data with each other, there are two cases: Either these threads belong to the same block, in which case they should use __syncthreads() and share data through shared memory within the same kernel invocation, or they belong to different blocks, in which case they must share data through global memory using two separate kernel invocations, one for writing to and one for reading from global memory. The second case is much less optimal since it adds the overhead of extra kernel invocations and global memory traffic. Its occurrence should therefore be minimized by mapping the algorithm to the CUDA programming model in such a way that the computations that require inter-thread communication are performed within a single thread block as much as possible.
对于并行工作负载,在算法中存在一些需要同步以便彼此共享数据的线程的情况下,存在两种情况:这些线程属于同一块,那么它们应该使用 __syncthreads() ,并通过同一内核调用中的共享内存共享数据;或者它们属于不同的块,那么它们必须通过全局内存共享数据,使用两个单独的内核调用,一个用于写入全局内存,另一个用于从全局内存读取。第二种情况远不及第一种优化,因为它增加了额外的内核调用开销和全局内存流量。因此,应尽量通过将算法映射到 CUDA 编程模型的方式来最小化其发生,以便尽可能在单个线程块内执行需要线程间通信的计算。

5.2.2. Device Level
5.2.2. 设备级别 

At a lower level, the application should maximize parallel execution between the multiprocessors of a device.
在较低级别上,应用程序应最大化设备的多处理器之间的并行执行。

Multiple kernels can execute concurrently on a device, so maximum utilization can also be achieved by using streams to enable enough kernels to execute concurrently as described in Asynchronous Concurrent Execution.
设备上可以同时执行多个内核,因此可以通过使用流来实现足够多的内核同时执行,从而实现最大利用率,如异步并发执行中所述。

5.2.3. Multiprocessor Level
5.2.3. 多处理器级别 

At an even lower level, the application should maximize parallel execution between the various functional units within a multiprocessor.
在更低的层面上,应用程序应该在多处理器内的各个功能单元之间最大程度地实现并行执行。

As described in Hardware Multithreading, a GPU multiprocessor primarily relies on thread-level parallelism to maximize utilization of its functional units. Utilization is therefore directly linked to the number of resident warps. At every instruction issue time, a warp scheduler selects an instruction that is ready to execute. This instruction can be another independent instruction of the same warp, exploiting instruction-level parallelism, or more commonly an instruction of another warp, exploiting thread-level parallelism. If a ready to execute instruction is selected it is issued to the active threads of the warp. The number of clock cycles it takes for a warp to be ready to execute its next instruction is called the latency, and full utilization is achieved when all warp schedulers always have some instruction to issue for some warp at every clock cycle during that latency period, or in other words, when latency is completely “hidden”. The number of instructions required to hide a latency of L clock cycles depends on the respective throughputs of these instructions (see Arithmetic Instructions for the throughputs of various arithmetic instructions). If we assume instructions with maximum throughput, it is equal to:
如《硬件多线程》中所述,GPU 多处理器主要依赖线程级并行性来最大化利用其功能单元。因此,利用率直接与驻留线程束的数量相关。在每个指令发出时间点,线程束调度器会选择一个准备执行的指令。这个指令可以是同一线程束的另一个独立指令,利用指令级并行性,或者更常见的是另一个线程束的指令,利用线程级并行性。如果选择了一个准备执行的指令,它将被发出到线程束的活动线程中。线程束准备执行其下一条指令所需的时钟周期数称为延迟,当所有线程束调度器在延迟期间的每个时钟周期总是有一些指令要发出给某个线程束时,或者换句话说,当延迟完全“隐藏”时,就实现了完全利用。隐藏延迟 L 个时钟周期所需的指令数量取决于这些指令的吞吐量(请参阅算术指令以获取各种算术指令的吞吐量)。如果我们假设具有最大吞吐量的指令,则等于:

  • 4L for devices of compute capability 5.x, 6.1, 6.2, 7.x and 8.x since for these devices, a multiprocessor issues one instruction per warp over one clock cycle for four warps at a time, as mentioned in Compute Capabilities.
    对于计算能力为 5.x、6.1、6.2、7.x 和 8.x 的设备,4L 是因为对于这些设备,每个多处理器在一个时钟周期内为四个 warp 中的每个 warp 发出一条指令,如在计算能力中所述。

  • 2L for devices of compute capability 6.0 since for these devices, the two instructions issued every cycle are one instruction for two different warps.
    对于计算能力为 6.0 的设备,每个周期发出两个指令,其中一个指令用于两个不同的 warp。

The most common reason a warp is not ready to execute its next instruction is that the instruction’s input operands are not available yet.
warp 无法执行其下一条指令的最常见原因是指令的输入操作数尚不可用。

If all input operands are registers, latency is caused by register dependencies, i.e., some of the input operands are written by some previous instruction(s) whose execution has not completed yet. In this case, the latency is equal to the execution time of the previous instruction and the warp schedulers must schedule instructions of other warps during that time. Execution time varies depending on the instruction. On devices of compute capability 7.x, for most arithmetic instructions, it is typically 4 clock cycles. This means that 16 active warps per multiprocessor (4 cycles, 4 warp schedulers) are required to hide arithmetic instruction latencies (assuming that warps execute instructions with maximum throughput, otherwise fewer warps are needed). If the individual warps exhibit instruction-level parallelism, i.e. have multiple independent instructions in their instruction stream, fewer warps are needed because multiple independent instructions from a single warp can be issued back to back.
如果所有输入操作数都是寄存器,则延迟是由寄存器依赖引起的,即一些输入操作数是由一些尚未完成执行的先前指令编写的。在这种情况下,延迟等于先前指令的执行时间,warp 调度器必须在此期间调度其他 warp 的指令。执行时间取决于指令。在计算能力为 7.x 的设备上,对于大多数算术指令,通常为 4 个时钟周期。这意味着每个多处理器需要 16 个活动 warp(4 个周期,4 个 warp 调度器)来隐藏算术指令的延迟(假设 warp 以最大吞吐量执行指令,否则需要更少的 warp)。如果各个 warp 表现出指令级并行性,即其指令流中有多个独立指令,那么需要更少的 warp,因为可以连续发出单个 warp 的多个独立指令。

If some input operand resides in off-chip memory, the latency is much higher: typically hundreds of clock cycles. The number of warps required to keep the warp schedulers busy during such high latency periods depends on the kernel code and its degree of instruction-level parallelism. In general, more warps are required if the ratio of the number of instructions with no off-chip memory operands (i.e., arithmetic instructions most of the time) to the number of instructions with off-chip memory operands is low (this ratio is commonly called the arithmetic intensity of the program).
如果某些输入操作数驻留在片外存储器中,则延迟会更高:通常是数百个时钟周期。在这种高延迟期间,保持 warp 调度器忙碌所需的 warp 数量取决于内核代码及其指令级并行度。一般来说,如果没有片外存储器操作数的指令数量(即大多数情况下是算术指令)与具有片外存储器操作数的指令数量的比率较低(这个比率通常称为程序的算术强度),则需要更多的 warp。

Another reason a warp is not ready to execute its next instruction is that it is waiting at some memory fence (Memory Fence Functions) or synchronization point (Synchronization Functions). A synchronization point can force the multiprocessor to idle as more and more warps wait for other warps in the same block to complete execution of instructions prior to the synchronization point. Having multiple resident blocks per multiprocessor can help reduce idling in this case, as warps from different blocks do not need to wait for each other at synchronization points.
另一个导致 warp 无法执行下一条指令的原因是它正在等待某个内存栅栏(内存栅栏函数)或同步点(同步函数)。同步点可能会强制多处理器空闲,因为越来越多的 warp 在同一块中等待其他 warp 完成同步点之前的指令执行。在每个多处理器中有多个常驻块可以帮助减少这种情况下的空闲,因为来自不同块的 warp 在同步点不需要相互等待。

The number of blocks and warps residing on each multiprocessor for a given kernel call depends on the execution configuration of the call (Execution Configuration), the memory resources of the multiprocessor, and the resource requirements of the kernel as described in Hardware Multithreading. Register and shared memory usage are reported by the compiler when compiling with the --ptxas-options=-v option.
每个多处理器上驻留的块数和线程束数取决于调用的执行配置(执行配置),多处理器的内存资源以及内核的资源需求,如硬件多线程中所述。编译时使用 --ptxas-options=-v 选项编译时,编译器会报告寄存器和共享内存的使用情况。

The total amount of shared memory required for a block is equal to the sum of the amount of statically allocated shared memory and the amount of dynamically allocated shared memory.
一个块所需的共享内存总量等于静态分配的共享内存量和动态分配的共享内存量之和。

The number of registers used by a kernel can have a significant impact on the number of resident warps. For example, for devices of compute capability 6.x, if a kernel uses 64 registers and each block has 512 threads and requires very little shared memory, then two blocks (i.e., 32 warps) can reside on the multiprocessor since they require 2x512x64 registers, which exactly matches the number of registers available on the multiprocessor. But as soon as the kernel uses one more register, only one block (i.e., 16 warps) can be resident since two blocks would require 2x512x65 registers, which are more registers than are available on the multiprocessor. Therefore, the compiler attempts to minimize register usage while keeping register spilling (see Device Memory Accesses) and the number of instructions to a minimum. Register usage can be controlled using the maxrregcount compiler option, the __launch_bounds__() qualifier as described in Launch Bounds, or the __maxnreg__() qualifier as described in Maximum Number of Registers per Thread.
内核使用的寄存器数量可能会对驻留 warp 数量产生重要影响。例如,对于计算能力为 6.x 的设备,如果一个内核使用 64 个寄存器,每个块有 512 个线程,并且需要非常少的共享内存,那么两个块(即 32 个 warp)可以驻留在多处理器上,因为它们需要 2x512x64 个寄存器,这恰好与多处理器上可用的寄存器数量相匹配。但是,一旦内核使用了一个以上的寄存器,只有一个块(即 16 个 warp)可以驻留,因为两个块将需要 2x512x65 个寄存器,这比多处理器上可用的寄存器数量多。因此,编译器尝试最小化寄存器使用,同时保持寄存器溢出(请参阅设备内存访问)和指令数量最少。可以使用 maxrregcount 编译器选项、在启动边界中描述的 __launch_bounds__() 限定符或在每个线程的最大寄存器数量中描述的 __maxnreg__() 限定符来控制寄存器使用。

The register file is organized as 32-bit registers. So, each variable stored in a register needs at least one 32-bit register, for example, a double variable uses two 32-bit registers.
寄存器文件被组织为 32 位寄存器。因此,存储在寄存器中的每个变量至少需要一个 32 位寄存器,例如,一个 double 变量使用两个 32 位寄存器。

The effect of execution configuration on performance for a given kernel call generally depends on the kernel code. Experimentation is therefore recommended. Applications can also parametrize execution configurations based on register file size and shared memory size, which depends on the compute capability of the device, as well as on the number of multiprocessors and memory bandwidth of the device, all of which can be queried using the runtime (see reference manual).
对于给定的内核调用,执行配置对性能的影响通常取决于内核代码。因此建议进行实验。应用程序还可以根据寄存器文件大小和共享内存大小对执行配置进行参数化,这取决于设备的计算能力,以及设备的多处理器数量和内存带宽,所有这些都可以使用运行时查询(请参阅参考手册)。

The number of threads per block should be chosen as a multiple of the warp size to avoid wasting computing resources with under-populated warps as much as possible.
每个块的线程数应选择为 warp 大小的倍数,以尽量避免浪费计算资源在人口稀少的 warp 中。

5.2.3.1. Occupancy Calculator
5.2.3.1. 占用率计算器 

Several API functions exist to assist programmers in choosing thread block size and cluster size based on register and shared memory requirements.
存在多个 API 函数可帮助程序员根据寄存器和共享内存需求选择线程块大小和集群大小。

  • The occupancy calculator API, cudaOccupancyMaxActiveBlocksPerMultiprocessor, can provide an occupancy prediction based on the block size and shared memory usage of a kernel. This function reports occupancy in terms of the number of concurrent thread blocks per multiprocessor.
    占用率计算器 API, cudaOccupancyMaxActiveBlocksPerMultiprocessor ,可以根据内核的块大小和共享内存使用情况提供占用率预测。该函数以每个多处理器的并发线程块数来报告占用率。

    • Note that this value can be converted to other metrics. Multiplying by the number of warps per block yields the number of concurrent warps per multiprocessor; further dividing concurrent warps by max warps per multiprocessor gives the occupancy as a percentage.
      请注意,此值可以转换为其他度量单位。将每个块的线程束数量相乘,得到每个多处理器的并发线程束数量;将并发线程束除以每个多处理器的最大线程束数,以百分比表示占用率。

  • The occupancy-based launch configurator APIs, cudaOccupancyMaxPotentialBlockSize and cudaOccupancyMaxPotentialBlockSizeVariableSMem, heuristically calculate an execution configuration that achieves the maximum multiprocessor-level occupancy.
    基于占用率的启动配置器 API, cudaOccupancyMaxPotentialBlockSizecudaOccupancyMaxPotentialBlockSizeVariableSMem ,启发式地计算出一个执行配置,以实现最大的多处理器级占用率。

  • The occupancy calculator API, cudaOccupancyMaxActiveClusters, can provided occupancy prediction based on the cluster size, block size and shared memory usage of a kernel. This function reports occupancy in terms of number of max active clusters of a given size on the GPU present in the system.
    占用率计算器 API, cudaOccupancyMaxActiveClusters ,可以根据集群大小、块大小和内核的共享内存使用情况提供占用率预测。该函数以系统中 GPU 上存在的给定大小的最大活动集群数量来报告占用率。

The following code sample calculates the occupancy of MyKernel. It then reports the occupancy level with the ratio between concurrent warps versus maximum warps per multiprocessor.
以下代码示例计算了 MyKernel 的占用率。然后报告了每个多处理器的并发 warp 与最大 warp 之间的比率的占用水平。

// Device code
__global__ void MyKernel(int *d, int *a, int *b)
{
    int idx = threadIdx.x + blockIdx.x * blockDim.x;
    d[idx] = a[idx] * b[idx];
}

// Host code
int main()
{
    int numBlocks;        // Occupancy in terms of active blocks
    int blockSize = 32;

    // These variables are used to convert occupancy to warps
    int device;
    cudaDeviceProp prop;
    int activeWarps;
    int maxWarps;

    cudaGetDevice(&device);
    cudaGetDeviceProperties(&prop, device);

    cudaOccupancyMaxActiveBlocksPerMultiprocessor(
        &numBlocks,
        MyKernel,
        blockSize,
        0);

    activeWarps = numBlocks * blockSize / prop.warpSize;
    maxWarps = prop.maxThreadsPerMultiProcessor / prop.warpSize;

    std::cout << "Occupancy: " << (double)activeWarps / maxWarps * 100 << "%" << std::endl;

    return 0;
}

The following code sample configures an occupancy-based kernel launch of MyKernel according to the user input.
以下代码示例根据用户输入配置了基于占用率的 MyKernel 内核启动。

// Device code
__global__ void MyKernel(int *array, int arrayCount)
{
    int idx = threadIdx.x + blockIdx.x * blockDim.x;
    if (idx < arrayCount) {
        array[idx] *= array[idx];
    }
}

// Host code
int launchMyKernel(int *array, int arrayCount)
{
    int blockSize;      // The launch configurator returned block size
    int minGridSize;    // The minimum grid size needed to achieve the
                        // maximum occupancy for a full device
                        // launch
    int gridSize;       // The actual grid size needed, based on input
                        // size

    cudaOccupancyMaxPotentialBlockSize(
        &minGridSize,
        &blockSize,
        (void*)MyKernel,
        0,
        arrayCount);

    // Round up according to array size
    gridSize = (arrayCount + blockSize - 1) / blockSize;

    MyKernel<<<gridSize, blockSize>>>(array, arrayCount);
    cudaDeviceSynchronize();

    // If interested, the occupancy can be calculated with
    // cudaOccupancyMaxActiveBlocksPerMultiprocessor

    return 0;
}

The following code sample shows how to use the cluster occupancy API to find the max number of active clusters of a given size. Example code below calucaltes occupancy for cluster of size 2 and 128 threads per block.
以下代码示例显示如何使用集群占用 API 来查找给定大小的活动集群的最大数量。下面的示例代码计算每个块大小为 2,每个块 128 个线程的集群占用。

Cluster size of 8 is forward compatible starting compute capability 9.0, except on GPU hardware or MIG configurations which are too small to support 8 multiprocessors in which case the maximum cluster size will be reduced. But it is recommended that the users query the maximum cluster size before launching a cluster kernel. Max cluster size can be queried using cudaOccupancyMaxPotentialClusterSize API.
8 个集群大小在计算能力 9.0 及更高版本上是向前兼容的,除了在 GPU 硬件或 MIG 配置上,这些配置太小无法支持 8 个多处理器,此时最大集群大小将会减小。但建议用户在启动集群内核之前查询最大集群大小。最大集群大小可以使用 cudaOccupancyMaxPotentialClusterSize API 查询。

{
  cudaLaunchConfig_t config = {0};
  config.gridDim = number_of_blocks;
  config.blockDim = 128; // threads_per_block = 128
  config.dynamicSmemBytes = dynamic_shared_memory_size;

  cudaLaunchAttribute attribute[1];
  attribute[0].id = cudaLaunchAttributeClusterDimension;
  attribute[0].val.clusterDim.x = 2; // cluster_size = 2
  attribute[0].val.clusterDim.y = 1;
  attribute[0].val.clusterDim.z = 1;
  config.attrs = attribute;
  config.numAttrs = 1;

  int max_cluster_size = 0;
  cudaOccupancyMaxPotentialClusterSize(&max_cluster_size, (void *)kernel, &config);

  int max_active_clusters = 0;
  cudaOccupancyMaxActiveClusters(&max_active_clusters, (void *)kernel, &config);

  std::cout << "Max Active Clusters of size 2: " << max_active_clusters << std::endl;
}

The CUDA Nsight Compute User Interface also provides a standalone occupancy calculator and launch configurator implementation in <CUDA_Toolkit_Path>/include/cuda_occupancy.h for any use cases that cannot depend on the CUDA software stack. The Nsight Compute version of the occupancy calculator is particularly useful as a learning tool that visualizes the impact of changes to the parameters that affect occupancy (block size, registers per thread, and shared memory per thread).
CUDA Nsight Compute 用户界面还提供了一个独立的占用率计算器和启动配置器实现,用于 <CUDA_Toolkit_Path>/include/cuda_occupancy.h 中的任何不能依赖于 CUDA 软件堆栈的用例。占用率计算器的 Nsight Compute 版本特别适用作为一个学习工具,可可视化影响占用率的参数更改的影响(块大小、每线程寄存器和每线程共享内存)。

5.3. Maximize Memory Throughput
5.3. 最大化内存吞吐量 

The first step in maximizing overall memory throughput for the application is to minimize data transfers with low bandwidth.
应用程序最大化整体内存吞吐量的第一步是最小化低带宽数据传输。

That means minimizing data transfers between the host and the device, as detailed in Data Transfer between Host and Device, since these have much lower bandwidth than data transfers between global memory and the device.
这意味着要尽量减少主机和设备之间的数据传输,详细信息请参阅主机和设备之间的数据传输,因为这些传输的带宽远低于全局内存和设备之间的数据传输。

That also means minimizing data transfers between global memory and the device by maximizing use of on-chip memory: shared memory and caches (i.e., L1 cache and L2 cache available on devices of compute capability 2.x and higher, texture cache and constant cache available on all devices).
这也意味着通过最大化使用片上内存(即共享内存和缓存,即在计算能力为 2.x 及更高的设备上可用的 L1 缓存和 L2 缓存,以及在所有设备上可用的纹理缓存和常量缓存),来最小化全局内存和设备之间的数据传输。

Shared memory is equivalent to a user-managed cache: The application explicitly allocates and accesses it. As illustrated in CUDA Runtime, a typical programming pattern is to stage data coming from device memory into shared memory; in other words, to have each thread of a block:
共享内存等同于用户管理的缓存:应用程序明确分配和访问它。如 CUDA Runtime 所示,一个典型的编程模式是将来自设备内存的数据暂存在共享内存中;换句话说,让每个块的线程:

  • Load data from device memory to shared memory,
    从设备内存加载数据到共享内存

  • Synchronize with all the other threads of the block so that each thread can safely read shared memory locations that were populated by different threads,
    与块的所有其他线程同步,以便每个线程可以安全地读取由不同线程填充的共享内存位置

  • Process the data in shared memory,
    在共享内存中处理数据

  • Synchronize again if necessary to make sure that shared memory has been updated with the results,
    如有必要,请再次同步,以确保共享内存已更新为结果

  • Write the results back to device memory.
    将结果写回设备内存。

For some applications (for example, for which global memory access patterns are data-dependent), a traditional hardware-managed cache is more appropriate to exploit data locality. As mentioned in Compute Capability 7.x, Compute Capability 8.x and Compute Capability 9.0, for devices of compute capability 7.x, 8.x and 9.0, the same on-chip memory is used for both L1 and shared memory, and how much of it is dedicated to L1 versus shared memory is configurable for each kernel call.
对于某些应用程序(例如,全局内存访问模式取决于数据的情况),传统的硬件管理缓存更适合利用数据局部性。如在计算能力 7.x、计算能力 8.x 和计算能力 9.0 中提到的,对于计算能力为 7.x、8.x 和 9.0 的设备,同一片片上内存用于 L1 和共享内存,每个内核调用可配置用于 L1 和共享内存的比例。

The throughput of memory accesses by a kernel can vary by an order of magnitude depending on access pattern for each type of memory. The next step in maximizing memory throughput is therefore to organize memory accesses as optimally as possible based on the optimal memory access patterns described in Device Memory Accesses. This optimization is especially important for global memory accesses as global memory bandwidth is low compared to available on-chip bandwidths and arithmetic instruction throughput, so non-optimal global memory accesses generally have a high impact on performance.
内核的内存访问吞吐量可以根据每种内存类型的访问模式而变化一个数量级。因此,最大化内存吞吐量的下一步是根据《设备内存访问》中描述的最佳内存访问模式尽可能优化地组织内存访问。这种优化对全局内存访问尤为重要,因为与可用的芯片上带宽和算术指令吞吐量相比,全局内存带宽较低,所以非最佳的全局内存访问通常会对性能产生较大影响。

5.3.1. Data Transfer between Host and Device
5.3.1. 主机和设备之间的数据传输 

Applications should strive to minimize data transfer between the host and the device. One way to accomplish this is to move more code from the host to the device, even if that means running kernels that do not expose enough parallelism to execute on the device with full efficiency. Intermediate data structures may be created in device memory, operated on by the device, and destroyed without ever being mapped by the host or copied to host memory.
应用程序应该努力最小化主机和设备之间的数据传输。实现这一目标的一种方法是将更多的代码从主机移动到设备,即使这意味着运行的内核没有足够的并行性在设备上以完全高效的方式执行。中间数据结构可以在设备内存中创建,由设备操作,并在不被主机映射或复制到主机内存的情况下被销毁。

Also, because of the overhead associated with each transfer, batching many small transfers into a single large transfer always performs better than making each transfer separately.
另外,由于每次传输都会带来开销,将许多小传输合并为单个大传输始终优于分别进行每次传输。

On systems with a front-side bus, higher performance for data transfers between host and device is achieved by using page-locked host memory as described in Page-Locked Host Memory.
在具有前端总线的系统上,通过使用锁定页面主机内存来实现主机和设备之间数据传输的更高性能,如《锁定页面主机内存》中所述。

In addition, when using mapped page-locked memory (Mapped Memory), there is no need to allocate any device memory and explicitly copy data between device and host memory. Data transfers are implicitly performed each time the kernel accesses the mapped memory. For maximum performance, these memory accesses must be coalesced as with accesses to global memory (see Device Memory Accesses). Assuming that they are and that the mapped memory is read or written only once, using mapped page-locked memory instead of explicit copies between device and host memory can be a win for performance.
此外,当使用映射的页锁定内存(映射内存)时,无需分配任何设备内存并在设备和主机内存之间显式复制数据。每次内核访问映射内存时都会隐式执行数据传输。为了获得最佳性能,这些内存访问必须像对全局内存的访问一样协同(请参阅设备内存访问)。假设它们是这样,并且映射内存只读取或写入一次,使用映射的页锁定内存而不是在设备和主机内存之间进行显式复制可能会提高性能。

On integrated systems where device memory and host memory are physically the same, any copy between host and device memory is superfluous and mapped page-locked memory should be used instead. Applications may query a device is integrated by checking that the integrated device property (see Device Enumeration) is equal to 1.
在集成系统中,设备内存和主机内存在物理上是相同的,因此主机内存和设备内存之间的任何复制都是多余的,应该使用映射的页锁定内存。应用程序可以通过检查集成设备属性(请参阅设备枚举)是否等于 1 来查询设备是否为 integrated

5.3.2. Device Memory Accesses
5.3.2. 设备内存访问 

An instruction that accesses addressable memory (i.e., global, local, shared, constant, or texture memory) might need to be re-issued multiple times depending on the distribution of the memory addresses across the threads within the warp. How the distribution affects the instruction throughput this way is specific to each type of memory and described in the following sections. For example, for global memory, as a general rule, the more scattered the addresses are, the more reduced the throughput is.
访问可寻址内存(即全局、本地、共享、常量或纹理内存)的指令可能需要根据 warp 内线程之间内存地址的分布多次重新发出。分布如何影响指令吞吐量是特定于每种类型的内存,并在以下部分中进行描述。例如,对于全局内存,一般规则是,地址分散得越厉害,吞吐量就越低。

Global Memory 全局内存

Global memory resides in device memory and device memory is accessed via 32-, 64-, or 128-byte memory transactions. These memory transactions must be naturally aligned: Only the 32-, 64-, or 128-byte segments of device memory that are aligned to their size (i.e., whose first address is a multiple of their size) can be read or written by memory transactions.
全局内存位于设备内存中,设备内存通过 32、64 或 128 字节的内存事务进行访问。这些内存事务必须自然对齐:只有对齐到其大小的 32、64 或 128 字节设备内存段(即,其第一个地址是其大小的倍数)才能被内存事务读取或写入。

When a warp executes an instruction that accesses global memory, it coalesces the memory accesses of the threads within the warp into one or more of these memory transactions depending on the size of the word accessed by each thread and the distribution of the memory addresses across the threads. In general, the more transactions are necessary, the more unused words are transferred in addition to the words accessed by the threads, reducing the instruction throughput accordingly. For example, if a 32-byte memory transaction is generated for each thread’s 4-byte access, throughput is divided by 8.
当一个 warp 执行访问全局内存的指令时,它将 warp 内的线程的内存访问合并为一个或多个内存事务,具体取决于每个线程访问的字大小以及内存地址在线程之间的分布。一般来说,需要的事务越多,除了线程访问的字之外,还会传输更多未使用的字,相应地降低指令吞吐量。例如,如果为每个线程的 4 字节访问生成 32 字节的内存事务,则吞吐量会减少 8 倍。

How many transactions are necessary and how much throughput is ultimately affected varies with the compute capability of the device. Compute Capability 5.x, Compute Capability 6.x, Compute Capability 7.x, Compute Capability 8.x and Compute Capability 9.0 give more details on how global memory accesses are handled for various compute capabilities.
设备的计算能力不同,所需的事务数量和最终受影响的吞吐量也会有所不同。计算能力 5.x、计算能力 6.x、计算能力 7.x、计算能力 8.x 和计算能力 9.0 提供了有关如何处理各种计算能力的全局内存访问的更多详细信息。

To maximize global memory throughput, it is therefore important to maximize coalescing by:
为了最大化全局内存吞吐量,因此重要的是通过最大化合并来实现:

  • Following the most optimal access patterns based on Compute Capability 5.x, Compute Capability 6.x, Compute Capability 7.x, Compute Capability 8.x and Compute Capability 9.0
    根据计算能力 5.x、计算能力 6.x、计算能力 7.x、计算能力 8.x 和计算能力 9.0,遵循最优访问模式。

  • Using data types that meet the size and alignment requirement detailed in the section Size and Alignment Requirement below,
    在下面的“大小和对齐要求”部分详细说明的满足大小和对齐要求的数据类型中使用

  • Padding data in some cases, for example, when accessing a two-dimensional array as described in the section Two-Dimensional Arrays below.
    在某些情况下填充数据,例如,在下面描述的二维数组部分访问二维数组时。

Size and Alignment Requirement
大小和对齐要求

Global memory instructions support reading or writing words of size equal to 1, 2, 4, 8, or 16 bytes. Any access (via a variable or a pointer) to data residing in global memory compiles to a single global memory instruction if and only if the size of the data type is 1, 2, 4, 8, or 16 bytes and the data is naturally aligned (i.e., its address is a multiple of that size).
全局内存指令支持读取或写入大小为 1、2、4、8 或 16 字节的字。只有当数据类型的大小为 1、2、4、8 或 16 字节且数据自然对齐(即其地址是该大小的倍数)时,对存储在全局内存中的数据的任何访问(通过变量或指针)都会编译为单个全局内存指令。

If this size and alignment requirement is not fulfilled, the access compiles to multiple instructions with interleaved access patterns that prevent these instructions from fully coalescing. It is therefore recommended to use types that meet this requirement for data that resides in global memory.
如果未满足此大小和对齐要求,则访问将编译为多个指令,其中包含交错访问模式,这些模式会阻止这些指令完全合并。因此建议对驻留在全局内存中的数据使用满足此要求的类型。

The alignment requirement is automatically fulfilled for the Built-in Vector Types.
内置矢量类型的对齐要求会自动得到满足。

For structures, the size and alignment requirements can be enforced by the compiler using the alignment specifiers__align__(8) or                                     __align__(16), such as
对于结构体,可以使用编译器使用对齐说明符 __align__(8) or                                     __align__(16) 来强制执行大小和对齐要求

struct __align__(8) {
    float x;
    float y;
};

or 

struct __align__(16) {
    float x;
    float y;
    float z;
};

Any address of a variable residing in global memory or returned by one of the memory allocation routines from the driver or runtime API is always aligned to at least 256 bytes.
任何存储在全局内存中的变量地址,或者由驱动程序或运行时 API 中的内存分配例程返回的地址,都至少按照 256 字节对齐。

Reading non-naturally aligned 8-byte or 16-byte words produces incorrect results (off by a few words), so special care must be taken to maintain alignment of the starting address of any value or array of values of these types. A typical case where this might be easily overlooked is when using some custom global memory allocation scheme, whereby the allocations of multiple arrays (with multiple calls to cudaMalloc() or cuMemAlloc()) is replaced by the allocation of a single large block of memory partitioned into multiple arrays, in which case the starting address of each array is offset from the block’s starting address.
读取非自然对齐的 8 字节或 16 字节字时会产生不正确的结果(偏移几个字),因此必须特别注意保持这些类型的任何值或值数组的起始地址的对齐。一个典型的情况是在使用某些自定义全局内存分配方案时可能会被轻易忽视,其中多个数组的分配(通过多次调用 cudaMalloc()cuMemAlloc() )被替换为分配一个被划分为多个数组的单个大内存块,此时每个数组的起始地址与块的起始地址有偏移。

Two-Dimensional Arrays 二维数组

A common global memory access pattern is when each thread of index (tx,ty) uses the following address to access one element of a 2D array of width width, located at address BaseAddress of type type* (where type meets the requirement described in Maximize Utilization):
一个常见的全局内存访问模式是,当索引为 (tx,ty) 的每个线程使用以下地址访问位于地址 BaseAddress 处的宽度为 width 的 2D 数组的一个元素,类型为 type* (其中 type 满足最大利用率描述的要求)

BaseAddress + width * ty + tx

For these accesses to be fully coalesced, both the width of the thread block and the width of the array must be a multiple of the warp size.
要使这些访问完全协同,线程块的宽度和数组的宽度都必须是 warp 大小的倍数。

In particular, this means that an array whose width is not a multiple of this size will be accessed much more efficiently if it is actually allocated with a width rounded up to the closest multiple of this size and its rows padded accordingly. The cudaMallocPitch() and cuMemAllocPitch() functions and associated memory copy functions described in the reference manual enable programmers to write non-hardware-dependent code to allocate arrays that conform to these constraints.
特别是,这意味着如果数组的宽度不是此大小的倍数,则如果实际分配的宽度四舍五入为最接近此大小的倍数,并相应地填充其行,则访问效率会大大提高。参考手册中描述的 cudaMallocPitch()cuMemAllocPitch() 函数以及相关的内存复制函数使程序员能够编写符合这些约束条件的数组的分配代码,而不依赖于硬件。

Local Memory 本地内存

Local memory accesses only occur for some automatic variables as mentioned in Variable Memory Space Specifiers. Automatic variables that the compiler is likely to place in local memory are:
本地内存访问仅适用于某些自动变量,如变量内存空间说明中所述。编译器可能放置在本地内存中的自动变量包括:

  • Arrays for which it cannot determine that they are indexed with constant quantities,
    无法确定它们是否使用常量数量进行索引的数组

  • Large structures or arrays that would consume too much register space,
    会占用太多寄存器空间的大型结构或数组

  • Any variable if the kernel uses more registers than available (this is also known as register spilling).
    如果内核使用的寄存器多于可用的寄存器(这也被称为寄存器溢出),则任何变量。

Inspection of the PTX assembly code (obtained by compiling with the -ptx or-keep option) will tell if a variable has been placed in local memory during the first compilation phases as it will be declared using the .local mnemonic and accessed using the ld.local and st.local mnemonics. Even if it has not, subsequent compilation phases might still decide otherwise though if they find it consumes too much register space for the targeted architecture: Inspection of the cubin object using cuobjdump will tell if this is the case. Also, the compiler reports total local memory usage per kernel (lmem) when compiling with the --ptxas-options=-v option. Note that some mathematical functions have implementation paths that might access local memory.
通过使用 -ptx-keep 选项编译获得的 PTX 汇编代码的检查将告诉您变量是否在第一次编译阶段被放置在本地内存中,因为它将使用 .local 助记符声明并使用 ld.localst.local 助记符访问。即使没有,随后的编译阶段可能仍会做出不同的决定,如果它们发现它对于目标架构消耗了太多寄存器空间:使用 cuobjdump 检查 cubin 对象将告诉您是否存在这种情况。此外,编译时使用 --ptxas-options=-v 选项时,编译器会报告每个内核的总本地内存使用量( lmem )。请注意,一些数学函数具有可能访问本地内存的实现路径。

The local memory space resides in device memory, so local memory accesses have the same high latency and low bandwidth as global memory accesses and are subject to the same requirements for memory coalescing as described in Device Memory Accesses. Local memory is however organized such that consecutive 32-bit words are accessed by consecutive thread IDs. Accesses are therefore fully coalesced as long as all threads in a warp access the same relative address (for example, same index in an array variable, same member in a structure variable).
本地内存空间位于设备内存中,因此本地内存访问具有与全局内存访问相同的高延迟和低带宽,并且受到与设备内存访问中描述的内存合并要求相同的限制。 本地内存的组织方式是,连续的 32 位字由连续的线程 ID 访问。 只要 warp 中的所有线程访问相同的相对地址(例如,数组变量中的相同索引,结构变量中的相同成员),访问就是完全合并的。

On devices of compute capability 5.x onwards, local memory accesses are always cached in L2 in the same way as global memory accesses (see Compute Capability 5.x and Compute Capability 6.x).
在计算能力为 5.x 及更高版本的设备上,本地内存访问始终以与全局内存访问相同的方式在 L2 中进行缓存(请参阅计算能力 5.x 和计算能力 6.x)。

Shared Memory 共享内存

Because it is on-chip, shared memory has much higher bandwidth and much lower latency than local or global memory.
由于它位于芯片上,共享内存的带宽比本地或全局内存高得多,延迟也比较低。

To achieve high bandwidth, shared memory is divided into equally-sized memory modules, called banks, which can be accessed simultaneously. Any memory read or write request made of n addresses that fall in n distinct memory banks can therefore be serviced simultaneously, yielding an overall bandwidth that is n times as high as the bandwidth of a single module.
为了实现高带宽,共享内存被划分为大小相等的内存模块,称为 bank,可以同时访问。因此,任何读取或写入请求,地址落在 n 个不同内存 bank 中,可以同时服务,从而产生整体带宽,是单个模块带宽的 n 倍。

However, if two addresses of a memory request fall in the same memory bank, there is a bank conflict and the access has to be serialized. The hardware splits a memory request with bank conflicts into as many separate conflict-free requests as necessary, decreasing throughput by a factor equal to the number of separate memory requests. If the number of separate memory requests is n, the initial memory request is said to cause n-way bank conflicts.
然而,如果内存请求的两个地址落在同一内存银行中,则存在银行冲突,访问必须被串行化。硬件将具有银行冲突的内存请求拆分为尽可能多的单独无冲突请求,通过减少与单独内存请求数量相等的因子来降低吞吐量。如果单独内存请求的数量为 n,则初始内存请求被称为引起 n 路银行冲突。

To get maximum performance, it is therefore important to understand how memory addresses map to memory banks in order to schedule the memory requests so as to minimize bank conflicts. This is described in Compute Capability 5.x, Compute Capability 6.x, Compute Capability 7.x, Compute Capability 8.x, and Compute Capability 9.0 for devices of compute capability 5.x, 6.x, 7.x, 8.x, and 9.0 respectively.
为了获得最佳性能,重要的是了解内存地址如何映射到内存银行,以便安排内存请求,从而最小化银行冲突。这在计算能力 5.x、计算能力 6.x、计算能力 7.x、计算能力 8.x 和计算能力 9.0 中进行描述,分别适用于计算能力为 5.x、6.x、7.x、8.x 和 9.0 的设备。

Constant Memory 常量内存

The constant memory space resides in device memory and is cached in the constant cache.
常量内存空间位于设备内存中,并缓存在常量缓存中。

A request is then split into as many separate requests as there are different memory addresses in the initial request, decreasing throughput by a factor equal to the number of separate requests.
请求随后被拆分为与初始请求中不同内存地址数量相同的单独请求,通过减少吞吐量的因子等于单独请求的数量。

The resulting requests are then serviced at the throughput of the constant cache in case of a cache hit, or at the throughput of device memory otherwise.
结果请求将在恒定缓存的吞吐量下服务,如果缓存命中,则在设备内存的吞吐量下服务。

Texture and Surface Memory
纹理和表面内存

The texture and surface memory spaces reside in device memory and are cached in texture cache, so a texture fetch or surface read costs one memory read from device memory only on a cache miss, otherwise it just costs one read from texture cache. The texture cache is optimized for 2D spatial locality, so threads of the same warp that read texture or surface addresses that are close together in 2D will achieve best performance. Also, it is designed for streaming fetches with a constant latency; a cache hit reduces DRAM bandwidth demand but not fetch latency.
纹理和表面内存空间驻留在设备内存中,并缓存在纹理缓存中,因此在缓存未命中时,纹理获取或表面读取仅会从设备内存中读取一次内存,否则只会从纹理缓存中读取一次。纹理缓存针对 2D 空间局部性进行了优化,因此读取 2D 空间中彼此接近的纹理或表面地址的同一 warp 的线程将获得最佳性能。此外,它设计用于具有恒定延迟的流式获取;缓存命中会减少 DRAM 带宽需求,但不会减少获取延迟。

Reading device memory through texture or surface fetching present some benefits that can make it an advantageous alternative to reading device memory from global or constant memory:
通过纹理或表面获取读取设备内存具有一些优点,这使得它成为从全局或常量内存读取设备内存的有利替代方案:

  • If the memory reads do not follow the access patterns that global or constant memory reads must follow to get good performance, higher bandwidth can be achieved providing that there is locality in the texture fetches or surface reads;
    如果内存读取不遵循全局或常量内存读取必须遵循的访问模式以获得良好性能,那么可以实现更高的带宽,前提是纹理获取或表面读取中存在局部性;

  • Addressing calculations are performed outside the kernel by dedicated units;
    专用单元在内核外执行寻址计算;

  • Packed data may be broadcast to separate variables in a single operation;
    打包数据可以在单个操作中广播到单独的变量;

  • 8-bit and 16-bit integer input data may be optionally converted to 32 bit floating-point values in the range [0.0, 1.0] or [-1.0, 1.0] (see Texture Memory).
    8 位和 16 位整数输入数据可以选择性地转换为范围为[0.0, 1.0]或[-1.0, 1.0]的 32 位浮点值(请参阅纹理内存)。

5.4. Maximize Instruction Throughput
5.4. 最大化指令吞吐量 

To maximize instruction throughput the application should:
为了最大化指令吞吐量,应用程序应该:

  • Minimize the use of arithmetic instructions with low throughput; this includes trading precision for speed when it does not affect the end result, such as using intrinsic instead of regular functions (intrinsic functions are listed in Intrinsic Functions), single-precision instead of double-precision, or flushing denormalized numbers to zero;
    尽量减少使用吞吐量低的算术指令;这包括在不影响最终结果的情况下,通过使用内部函数而不是常规函数(内部函数列在内部函数中)、单精度而不是双精度,或将非规格化数值刷新为零来交换精度以提高速度;

  • Minimize divergent warps caused by control flow instructions as detailed in Control Flow Instructions
    尽量减少由控制流指令引起的发散线程束,详见控制流指令

  • Reduce the number of instructions, for example, by optimizing out synchronization points whenever possible as described in Synchronization Instruction or by using restricted pointers as described in __restrict__.
    减少指令数量,例如,通过优化尽可能消除同步点,如同步指令中所述,或者使用如__restrict__中所述的受限指针。

In this section, throughputs are given in number of operations per clock cycle per multiprocessor. For a warp size of 32, one instruction corresponds to 32 operations, so if N is the number of operations per clock cycle, the instruction throughput is N/32 instructions per clock cycle.
在本节中,吞吐量以每个多处理器每个时钟周期的操作数量给出。对于 warp 大小为 32,一个指令对应于 32 个操作,因此如果 N 是每个时钟周期的操作数量,则指令吞吐量为 N/32 个指令每个时钟周期。

All throughputs are for one multiprocessor. They must be multiplied by the number of multiprocessors in the device to get throughput for the whole device.
所有吞吐量均针对一个多处理器。必须将其乘以设备中的多处理器数量,以获得整个设备的吞吐量。

5.4.1. Arithmetic Instructions
5.4.1. 算术指令 

The following table gives the throughputs of the arithmetic instructions that are natively supported in hardware for devices of various compute capabilities.
下表列出了各种计算能力设备中硬件原生支持的算术指令的吞吐量。

Table 4 Throughput of Native Arithmetic Instructions. (Number of Results per Clock Cycle per Multiprocessor)
表 4 本机算术指令吞吐量。 (每个多处理器每个时钟周期的结果数) 

Compute Capability 计算能力

5.0, 5.2

5.3

6.0

6.1

6.2

7.x

8.0

8.6

8.9

9.0

16-bit floating-point add, multiply, multiply-add
16 位浮点加法、乘法、乘加

N/A

256

128

2

256

128

2563

128

256

32-bit floating-point add, multiply, multiply-add
32 位浮点加法、乘法、乘加

128

64

128

64

128

64-bit floating-point add, multiply, multiply-add
64 位浮点加法、乘法、乘加

4

32

4

325

32

2

64

32-bit floating-point reciprocal, reciprocal square root, base-2 logarithm (__log2f), base 2 exponential (exp2f), sine (__sinf), cosine (__cosf)
32 位浮点数倒数,倒数平方根,以 2 为底对数( __log2f ),以 2 为底指数( exp2f ),正弦( __sinf ),余弦( __cosf

32

16

32

16

32-bit integer add, extended-precision add, subtract, extended-precision subtract
32 位整数加法,扩展精度加法,减法,扩展精度减法

128

64

128

64

32-bit integer multiply, multiply-add, extended-precision multiply-add
32 位整数乘法,乘法加法,扩展精度乘法加法

Multiple instruct. 多个指令。

646

24-bit integer multiply (__[u]mul24)
24 位整数乘法 ( __[u]mul24 )

Multiple instruct. 多个指令。

32-bit integer shift 32 位整数移位

64

32

64

compare, minimum, maximum
比较,最小值,最大值

64

32

64

32-bit integer bit reverse
32 位整数位反转

64

32

64

16

Bit field extract/insert 位字段提取/插入

64

32

64

Multiple Instruct. 多个指令。

64

32-bit bitwise AND, OR, XOR
32 位按位与、或、异或

128

64

128

64

count of leading zeros, most significant non-sign bit
领先零计数,最高有效非符号位

32

16

32

16

population count 人口统计

32

16

32

16

warp shuffle

32

328

32

warp reduce

Multiple instruct. 多个指令。

16

warp vote warp 投票

64

sum of absolute difference
绝对差之和

64

32

64

SIMD video instructions vabsdiff2
SIMD 视频指令 vabsdiff2

Multiple instruct. 多个指令。

SIMD video instructions vabsdiff4
SIMD 视频指令 vabsdiff4

Multiple instruct. 多个指令。

64

All other SIMD video instructions
所有其他 SIMD 视频指令

Multiple instruct. 多个指令。

Type conversions from 8-bit and 16-bit integer to 32-bit integer types
从 8 位和 16 位整数到 32 位整数类型的类型转换

32

16

32

64

Type conversions from and to 64-bit types
64 位类型之间的类型转换

4

16

4

1610

16

2

2

16

All other type conversions
所有其他类型转换

32

16

32

16

16-bit DPX 16 位 DPX

Multiple instruct. 多个指令。

128

32-bit DPX 32 位 DPX

Multiple instruct. 多个指令。

64

Other instructions and functions are implemented on top of the native instructions. The implementation may be different for devices of different compute capabilities, and the number of native instructions after compilation may fluctuate with every compiler version. For complicated functions, there can be multiple code paths depending on input. cuobjdump can be used to inspect a particular implementation in a cubin object.
其他指令和功能是在本机指令的基础上实现的。实现可能因不同计算能力的设备而异,并且在每个编译器版本后编译后的本机指令数量可能会波动。对于复杂功能,可以根据输入有多个代码路径。 cuobjdump 可用于检查 cubin 对象中的特定实现。

The implementation of some functions are readily available on the CUDA header files (math_functions.h, device_functions.h, …).
一些函数的实现已经在 CUDA 头文件中准备就绪( math_functions.hdevice_functions.h ,...)。

In general, code compiled with -ftz=true (denormalized numbers are flushed to zero) tends to have higher performance than code compiled with -ftz=false. Similarly, code compiled with -prec-div=false (less precise division) tends to have higher performance code than code compiled with -prec-div=true, and code compiled with -prec-sqrt=false (less precise square root) tends to have higher performance than code compiled with -prec-sqrt=true. The nvcc user manual describes these compilation flags in more details.
通常,使用 -ftz=true 编译的代码(非规范化数字被刷新为零)往往比使用 -ftz=false 编译的代码性能更高。同样,使用 -prec-div=false 编译的代码(较不精确的除法)往往比使用 -prec-div=true 编译的代码性能更高,而使用 -prec-sqrt=false 编译的代码(较不精确的平方根)往往比使用 -prec-sqrt=true 编译的代码性能更高。nvcc 用户手册详细描述了这些编译标志。

Single-Precision Floating-Point Division
单精度浮点数除法

__fdividef(x, y) (see Intrinsic Functions) provides faster single-precision floating-point division than the division operator.
__fdividef(x, y) (请参阅内在函数)提供比除法运算符更快的单精度浮点除法。

Single-Precision Floating-Point Reciprocal Square Root
单精度浮点数倒数平方根

To preserve IEEE-754 semantics the compiler can optimize 1.0/sqrtf() into rsqrtf() only when both reciprocal and square root are approximate, (i.e., with -prec-div=false and -prec-sqrt=false). It is therefore recommended to invoke rsqrtf() directly where desired.
为了保持 IEEE-754 语义,当倒数和平方根都是近似值(即 -prec-div=false-prec-sqrt=false )时,编译器可以将 1.0/sqrtf() 优化为 rsqrtf() 。因此建议在需要时直接调用 rsqrtf()

Single-Precision Floating-Point Square Root
单精度浮点平方根

Single-precision floating-point square root is implemented as a reciprocal square root followed by a reciprocal instead of a reciprocal square root followed by a multiplication so that it gives correct results for 0 and infinity.
单精度浮点平方根是实现为倒数平方根后跟一个倒数,而不是倒数平方根后跟一个乘法,这样可以为 0 和无穷大提供正确的结果。

Sine and Cosine 正弦和余弦

sinf(x), cosf(x), tanf(x), sincosf(x), and corresponding double-precision instructions are much more expensive and even more so if the argument x is large in magnitude.
sinf(x)cosf(x)tanf(x)sincosf(x) ,如果参数 x 的数量级较大,则相应的双精度指令成本更高。

More precisely, the argument reduction code (see Mathematical Functions for implementation) comprises two code paths referred to as the fast path and the slow path, respectively.
更准确地说,参数缩减代码(请参阅实现的数学函数)包括两个分别称为快速路径和慢速路径的代码路径。

The fast path is used for arguments sufficiently small in magnitude and essentially consists of a few multiply-add operations. The slow path is used for arguments large in magnitude and consists of lengthy computations required to achieve correct results over the entire argument range.
快速路径用于幅度足够小的参数,基本上由几个乘加操作组成。慢速路径用于幅度较大的参数,并包括需要进行漫长计算以获得整个参数范围上正确结果的计算。

At present, the argument reduction code for the trigonometric functions selects the fast path for arguments whose magnitude is less than 105615.0f for the single-precision functions, and less than 2147483648.0 for the double-precision functions.
目前,三角函数的参数缩减代码选择快速路径,对于幅度小于 105615.0f 的单精度函数,以及幅度小于 2147483648.0 的双精度函数。

As the slow path requires more registers than the fast path, an attempt has been made to reduce register pressure in the slow path by storing some intermediate variables in local memory, which may affect performance because of local memory high latency and bandwidth (see Device Memory Accesses). At present, 28 bytes of local memory are used by single-precision functions, and 44 bytes are used by double-precision functions. However, the exact amount is subject to change.
由于慢路径需要比快路径更多的寄存器,因此尝试通过将一些中间变量存储在本地内存中来减少慢路径中的寄存器压力,这可能会影响性能,因为本地内存具有较高的延迟和带宽(请参阅设备内存访问)。目前,单精度函数使用 28 字节本地内存,双精度函数使用 44 字节本地内存。但是,确切的数量可能会发生变化。

Due to the lengthy computations and use of local memory in the slow path, the throughput of these trigonometric functions is lower by one order of magnitude when the slow path reduction is required as opposed to the fast path reduction.
由于慢路径中的计算量大且使用本地内存,当需要慢路径减少时,这些三角函数的吞吐量比快速路径减少低一个数量级。

Integer Arithmetic 整数算术

Integer division and modulo operation are costly as they compile to up to 20 instructions. They can be replaced with bitwise operations in some cases: If n is a power of 2, (i/n) is equivalent to (i>>log2(n)) and (i%n) is equivalent to (i&(n-1)); the compiler will perform these conversions if n is literal.
整数除法和取模运算成本高,因为它们编译成多达 20 条指令。在某些情况下,它们可以被位操作替代:如果 n 是 2 的幂,则 ( i/n ) 等同于 (i>>log2(n))(i%n) 等同于 ( i&(n-1) );如果 n 是字面值,编译器将执行这些转换。

__brev and __popc map to a single instruction and __brevll and __popcll to a few instructions.
__brev__popc 映射到一个指令,而 __brevll__popcll 映射到几个指令。

__[u]mul24 are legacy intrinsic functions that no longer have any reason to be used.
__[u]mul24 是过时的内置函数,不再有任何理由使用。

Half Precision Arithmetic
半精度算术

In order to achieve good performance for 16-bit precision floating-point add, multiply or multiply-add, it is recommended that the half2 datatype is used for half precision and __nv_bfloat162 be used for __nv_bfloat16 precision. Vector intrinsics (for example, __hadd2, __hsub2, __hmul2, __hfma2) can then be used to do two operations in a single instruction. Using half2 or __nv_bfloat162 in place of two calls using half or __nv_bfloat16 may also help performance of other intrinsics, such as warp shuffles.
为了实现 16 位精度浮点加法、乘法或乘加法的良好性能,建议使用 half2 数据类型来实现 half 精度,使用 __nv_bfloat162 来实现 __nv_bfloat16 精度。然后可以使用矢量内联函数(例如, __hadd2__hsub2__hmul2__hfma2 )在单个指令中执行两个操作。在其他内联函数的性能方面,例如 warp shuffles,使用 half2__nv_bfloat162 替代两次调用 half__nv_bfloat16 也可能有所帮助。

The intrinsic __halves2half2 is provided to convert two half precision values to the half2 datatype.
内在的 __halves2half2 用于将两个 half 精度值转换为 half2 数据类型。

The intrinsic __halves2bfloat162 is provided to convert two __nv_bfloat precision values to the __nv_bfloat162 datatype.
内在的 __halves2bfloat162 用于将两个 __nv_bfloat 精度值转换为 __nv_bfloat162 数据类型。

Type Conversion 类型转换

Sometimes, the compiler must insert conversion instructions, introducing additional execution cycles. This is the case for:
有时,编译器必须插入转换指令,引入额外的执行周期。这种情况包括:

  • Functions operating on variables of type char or short whose operands generally need to be converted to int,
    操作类型为 charshort 的变量的函数通常需要将其操作数转换为 int

  • Double-precision floating-point constants (i.e., those constants defined without any type suffix) used as input to single-precision floating-point computations (as mandated by C/C++ standards).
    双精度浮点常量(即,那些没有任何类型后缀定义的常量)用作单精度浮点计算的输入(根据 C/C++标准规定)。

This last case can be avoided by using single-precision floating-point constants, defined with an f suffix such as 3.141592653589793f, 1.0f, 0.5f.
最后一种情况可以通过使用单精度浮点常量来避免,这些常量使用 f 后缀定义,例如 3.141592653589793f1.0f0.5f

5.4.2. Control Flow Instructions
5.4.2. 控制流指令 

Any flow control instruction (if, switch, do, for, while) can significantly impact the effective instruction throughput by causing threads of the same warp to diverge (i.e., to follow different execution paths). If this happens, the different executions paths have to be serialized, increasing the total number of instructions executed for this warp.
任何流程控制指令( ifswitchdoforwhile )都可能通过导致同一 warp 的线程分歧(即,遵循不同的执行路径)来显著影响有效指令吞吐量。如果发生这种情况,不同的执行路径必须被串行化,增加了该 warp 执行的总指令数。

To obtain best performance in cases where the control flow depends on the thread ID, the controlling condition should be written so as to minimize the number of divergent warps. This is possible because the distribution of the warps across the block is deterministic as mentioned in SIMT Architecture. A trivial example is when the controlling condition only depends on (threadIdx / warpSize) where warpSize is the warp size. In this case, no warp diverges since the controlling condition is perfectly aligned with the warps.
为了在控制流取决于线程 ID 的情况下获得最佳性能,应该编写控制条件以最小化分歧 warp 的数量。这是可能的,因为 warp 在块内的分布是确定性的,如 SIMT 架构中所述。一个简单的例子是当控制条件仅依赖于( threadIdx / warpSize )其中 warpSize 是 warp 大小。在这种情况下,没有 warp 分歧,因为控制条件与 warp 完全对齐。

Sometimes, the compiler may unroll loops or it may optimize out short if or switch blocks by using branch predication instead, as detailed below. In these cases, no warp can ever diverge. The programmer can also control loop unrolling using the #pragma unroll directive (see #pragma unroll).
有时,编译器可能会展开循环,或者通过使用分支预测来优化短 ifswitch 块,如下所述。在这些情况下,没有 warp 会发生分歧。程序员还可以使用 #pragma unroll 指令来控制循环展开(请参阅#pragma unroll)。

When using branch predication none of the instructions whose execution depends on the controlling condition gets skipped. Instead, each of them is associated with a per-thread condition code or predicate that is set to true or false based on the controlling condition and although each of these instructions gets scheduled for execution, only the instructions with a true predicate are actually executed. Instructions with a false predicate do not write results, and also do not evaluate addresses or read operands.
在使用分支预测时,取决于控制条件执行的指令都不会被跳过。相反,每个指令都与一个每个线程的条件码或谓词相关联,根据控制条件设置为 true 或 false,尽管每个指令都被调度执行,但只有具有 true 谓词的指令才会实际执行。具有 false 谓词的指令不会写入结果,也不会评估地址或读取操作数。

5.4.3. Synchronization Instruction
5.4.3. 同步指令 

Throughput for __syncthreads() is 32 operations per clock cycle for devices of compute capability 6.0, 16 operations per clock cycle for devices of compute capability 7.x as well as 8.x and 64 operations per clock cycle for devices of compute capability 5.x, 6.1 and 6.2.
对于 __syncthreads() ,计算能力为 6.0 的设备每个时钟周期执行 32 个操作,计算能力为 7.x、8.x 的设备每个时钟周期执行 16 个操作,计算能力为 5.x、6.1 和 6.2 的设备每个时钟周期执行 64 个操作。

Note that __syncthreads() can impact performance by forcing the multiprocessor to idle as detailed in Device Memory Accesses.
请注意, __syncthreads() 可能会影响性能,因为它会强制多处理器处于空闲状态,详细信息请参阅设备内存访问。

5.5. Minimize Memory Thrashing
5.5. 最小化内存抖动 

Applications that constantly allocate and free memory too often may find that the allocation calls tend to get slower over time up to a limit. This is typically expected due to the nature of releasing memory back to the operating system for its own use. For best performance in this regard, we recommend the following:
不断分配和释放内存的应用程序可能会发现,随着时间的推移,分配调用的速度会变慢,直到达到一个限制。这通常是由于释放内存以供操作系统自身使用的性质所致。为了在这方面获得最佳性能,我们建议采取以下措施:

  • Try to size your allocation to the problem at hand. Don’t try to allocate all available memory with cudaMalloc / cudaMallocHost / cuMemCreate, as this forces memory to be resident immediately and prevents other applications from being able to use that memory. This can put more pressure on operating system schedulers, or just prevent other applications using the same GPU from running entirely.
    尝试根据手头的问题大小调整分配。不要尝试使用 cudaMalloc / cudaMallocHost / cuMemCreate 分配所有可用内存,因为这会强制内存立即常驻,并阻止其他应用程序能够使用该内存。这可能会给操作系统调度程序带来更大压力,或者完全阻止其他使用同一 GPU 的应用程序运行。

  • Try to allocate memory in appropriately sized allocations early in the application and allocations only when the application does not have any use for it. Reduce the number of cudaMalloc+cudaFree calls in the application, especially in performance-critical regions.
    尝试在应用程序早期以适当大小的分配内存,并仅在应用程序不再需要时进行分配。减少应用程序中 cudaMalloc + cudaFree 调用的次数,特别是在性能关键区域。

  • If an application cannot allocate enough device memory, consider falling back on other memory types such as cudaMallocHost or cudaMallocManaged, which may not be as performant, but will enable the application to make progress.
    如果应用程序无法分配足够的设备内存,请考虑退而求其次使用其他内存类型,例如 cudaMallocHostcudaMallocManaged ,这可能不够高效,但可以使应用程序继续运行。

  • For platforms that support the feature, cudaMallocManaged allows for oversubscription, and with the correct cudaMemAdvise policies enabled, will allow the application to retain most if not all the performance of cudaMalloc. cudaMallocManaged also won’t force an allocation to be resident until it is needed or prefetched, reducing the overall pressure on the operating system schedulers and better enabling multi-tenet use cases.
    对于支持该功能的平台, cudaMallocManaged 允许超额订阅,并且启用正确的 cudaMemAdvise 策略后,将允许应用程序保留 cudaMalloc 的大部分甚至全部性能。 cudaMallocManaged 也不会强制分配保留,直到需要或预取,从而减少对操作系统调度程序的整体压力,并更好地支持多租户用例。

3

128 for __nv_bfloat16 128 用于 __nv_bfloat16

4

8 for GeForce GPUs, except for Titan GPUs
适用于 GeForce GPU 的 8,不包括 Titan GPU

5

2 for compute capability 7.5 GPUs
2 适用于计算能力 7.5 的 GPU

6

32 for extended-precision
32 用于扩展精度

7

32 for GeForce GPUs, except for Titan GPUs
32 适用于 GeForce GPU,不适用于 Titan GPU

8

16 for compute capabilities 7.5 GPUs
16 适用于计算能力为 7.5 的 GPU

9

8 for GeForce GPUs, except for Titan GPUs
适用于 GeForce GPU 的 8,不包括 Titan GPU

10

2 for compute capabilities 7.5 GPUs
2 用于计算能力为 7.5 的 GPU

6. CUDA-Enabled GPUs
6. 支持 CUDA 的 GPU 

https://developer.nvidia.com/cuda-gpus lists all CUDA-enabled devices with their compute capability.
https://developer.nvidia.com/cuda-gpus 列出了所有支持 CUDA 的设备及其计算能力。

The compute capability, number of multiprocessors, clock frequency, total amount of device memory, and other properties can be queried using the runtime (see reference manual).
可以使用运行时查询计算能力、多处理器数量、时钟频率、设备内存总量和其他属性(请参阅参考手册)。

7. C++ Language Extensions
7. C++ 语言扩展 

7.1. Function Execution Space Specifiers
7.1. 函数执行空间说明符 

Function execution space specifiers denote whether a function executes on the host or on the device and whether it is callable from the host or from the device.
函数执行空间修饰符表示函数是在主机上执行还是在设备上执行,以及它是可以从主机调用还是从设备调用。

7.1.1. __global__

The __global__ execution space specifier declares a function as being a kernel. Such a function is:
__global__ 执行空间说明符声明函数为内核。这样的函数是:

  • Executed on the device, 在设备上执行,

  • Callable from the host, 可从主机调用,

  • Callable from the device for devices of compute capability 5.0 or higher (see CUDA Dynamic Parallelism for more details).
    对于计算能力为 5.0 或更高的设备,可以从设备调用(请参阅 CUDA 动态并行性以获取更多详细信息)。

A __global__ function must have void return type, and cannot be a member of a class.
一个 __global__ 函数必须具有 void 返回类型,并且不能是类的成员。

Any call to a __global__ function must specify its execution configuration as described in Execution Configuration.
任何对 __global__ 函数的调用都必须根据执行配置中的描述指定其执行配置。

A call to a __global__ function is asynchronous, meaning it returns before the device has completed its execution.
调用 __global__ 函数是异步的,这意味着它在设备完成执行之前返回。

7.1.2. __device__

The __device__ execution space specifier declares a function that is:
__device__ 执行空间说明符声明一个函数,该函数是:

  • Executed on the device, 在设备上执行,

  • Callable from the device only.
    仅可从设备调用。

The __global__ and __device__ execution space specifiers cannot be used together.
__global____device__ 执行空间说明符不能同时使用。

7.1.3. __host__

The __host__ execution space specifier declares a function that is:
__host__ 执行空间说明符声明一个函数,该函数是:

  • Executed on the host, 在主机上执行,

  • Callable from the host only.
    仅可从主机调用。

It is equivalent to declare a function with only the __host__ execution space specifier or to declare it without any of the __host__, __device__, or __global__ execution space specifier; in either case the function is compiled for the host only.
这相当于仅使用 __host__ 执行空间说明符声明函数,或者声明函数时不使用 __host____device____global__ 任何执行空间说明符;在任一情况下,函数仅为主机编译。

The __global__ and __host__ execution space specifiers cannot be used together.
__global____host__ 执行空间说明符不能同时使用。

The __device__ and __host__ execution space specifiers can be used together however, in which case the function is compiled for both the host and the device. The __CUDA_ARCH__ macro introduced in Application Compatibility can be used to differentiate code paths between host and device:
__device____host__ 执行空间说明符可以一起使用,此时函数将为主机和设备编译。在应用兼容性中引入的 __CUDA_ARCH__ 宏可用于区分主机和设备之间的代码路径:

__host__ __device__ func()
{
#if __CUDA_ARCH__ >= 800
   // Device code path for compute capability 8.x
#elif __CUDA_ARCH__ >= 700
   // Device code path for compute capability 7.x
#elif __CUDA_ARCH__ >= 600
   // Device code path for compute capability 6.x
#elif __CUDA_ARCH__ >= 500
   // Device code path for compute capability 5.x
#elif !defined(__CUDA_ARCH__)
   // Host code path
#endif
}

7.1.4. Undefined behavior
7.1.4. 未定义行为 

A ‘cross-execution space’ call has undefined behavior when:
当发生“跨执行空间”调用时,行为未定义的情况包括:

  • __CUDA_ARCH__ is defined, a call from within a __global__, __device__ or __host__ __device__ function to a __host__ function.
    __CUDA_ARCH__ 被定义为,在 __global____device____host__ __device__ 函数内部调用 __host__ 函数。

  • __CUDA_ARCH__ is undefined, a call from within a __host__ function to a __device__ function. 9
    __CUDA_ARCH__ 未定义,从 __host__ 函数调用到 __device__ 函数。9

7.1.5. __noinline__ and __forceinline__
7.1.5. __noinline__ 和 __forceinline__ 

The compiler inlines any __device__ function when deemed appropriate.
编译器在适当时候内联任何 __device__ 函数。

The __noinline__ function qualifier can be used as a hint for the compiler not to inline the function if possible.
__noinline__ 函数限定符可用作提示编译器尽量不要内联函数。

The __forceinline__ function qualifier can be used to force the compiler to inline the function.
__forceinline__ 函数限定符可用于强制编译器内联函数。

The __noinline__ and __forceinline__ function qualifiers cannot be used together, and neither function qualifier can be applied to an inline function.
__noinline____forceinline__ 函数修饰符不能同时使用,也不能应用于内联函数。

7.1.6. __inline_hint__

The __inline_hint__ qualifier enables more aggressive inlining in the compiler. Unlike __forceinline__, it does not imply that the function is inline. It can be used to improve inlining across modules when using LTO.
__inline_hint__ 限定符在编译器中启用更积极的内联。与 __forceinline__ 不同,它并不意味着该函数是内联的。在使用 LTO 时,它可用于改善跨模块的内联。

Neither the __noinline__ nor the __forceinline__ function qualifier can be used with the __inline_hint__ function qualifier.
既不能与 __inline_hint__ 函数修饰符一起使用 __noinline__ 函数修饰符,也不能与 __forceinline__ 函数修饰符一起使用。

7.2. Variable Memory Space Specifiers
7.2. 变量内存空间说明符 

Variable memory space specifiers denote the memory location on the device of a variable.
变量内存空间修饰符表示设备上变量的内存位置。

An automatic variable declared in device code without any of the __device__, __shared__ and __constant__ memory space specifiers described in this section generally resides in a register. However in some cases the compiler might choose to place it in local memory, which can have adverse performance consequences as detailed in Device Memory Accesses.
在设备代码中声明的自动变量,如果没有在本节中描述的 __device____shared____constant__ 内存空间限定符中的任何一个,通常驻留在寄存器中。但在某些情况下,编译器可能选择将其放置在本地内存中,这可能会导致性能不佳,详细信息请参阅设备内存访问。

7.2.1. __device__

The __device__ memory space specifier declares a variable that resides on the device.
__device__ 内存空间说明符声明一个驻留在设备上的变量。

At most one of the other memory space specifiers defined in the next three sections may be used together with __device__ to further denote which memory space the variable belongs to. If none of them is present, the variable:
在接下来的三个部分中定义的其他内存空间说明符中最多只能与 __device__ 一起使用,以进一步表示变量属于哪个内存空间。如果没有其中任何一个存在,该变量:

  • Resides in global memory space,
    存储在全局内存空间中,

  • Has the lifetime of the CUDA context in which it is created,
    在创建它的 CUDA 上下文的生命周期内,

  • Has a distinct object per device,
    每个设备都有一个独特的对象

  • Is accessible from all the threads within the grid and from the host through the runtime library (cudaGetSymbolAddress() / cudaGetSymbolSize() / cudaMemcpyToSymbol() / cudaMemcpyFromSymbol()).
    可以从网格内的所有线程以及通过运行时库从主机访问 (cudaGetSymbolAddress() / cudaGetSymbolSize() / cudaMemcpyToSymbol() / cudaMemcpyFromSymbol()

7.2.2. __constant__

The __constant__ memory space specifier, optionally used together with __device__, declares a variable that:
__constant__ 内存空间说明符,可选择与 __device__ 一起使用,声明一个变量,该变量:

  • Resides in constant memory space,
    存储在常量内存空间中,

  • Has the lifetime of the CUDA context in which it is created,
    在创建它的 CUDA 上下文的生命周期内,

  • Has a distinct object per device,
    每个设备都有一个独特的对象

  • Is accessible from all the threads within the grid and from the host through the runtime library (cudaGetSymbolAddress() / cudaGetSymbolSize() / cudaMemcpyToSymbol() / cudaMemcpyFromSymbol()).
    可以从网格内的所有线程以及通过运行时库从主机访问( cudaGetSymbolAddress() / cudaGetSymbolSize() / cudaMemcpyToSymbol() / cudaMemcpyFromSymbol() )。

The behavior of modifying a constant from the host while there is a concurrent grid that access that constant at any point of this grid’s lifetime is undefined.
在主机修改常量的行为,同时存在一个并发网格,该网格在其生命周期的任何时刻访问该常量,是未定义的。

7.2.3. __shared__

The __shared__ memory space specifier, optionally used together with __device__, declares a variable that:
__shared__ 内存空间说明符,可选择与 __device__ 一起使用,声明一个变量,该变量:

  • Resides in the shared memory space of a thread block,
    位于线程块的共享内存空间中,

  • Has the lifetime of the block,
    块的生命周期,

  • Has a distinct object per block,
    每个块都有一个独特的对象

  • Is only accessible from all the threads within the block,
    仅可从块内的所有线程访问,

  • Does not have a constant address.
    没有固定地址。

When declaring a variable in shared memory as an external array such as
在共享内存中声明一个外部数组变量时,例如

extern __shared__ float shared[];

the size of the array is determined at launch time (see Execution Configuration). All variables declared in this fashion, start at the same address in memory, so that the layout of the variables in the array must be explicitly managed through offsets. For example, if one wants the equivalent of
数组的大小在启动时确定(请参阅执行配置)。以这种方式声明的所有变量都从内存中的相同地址开始,因此必须通过偏移量明确管理数组中变量的布局。例如,如果想要相当于

short array0[128];
float array1[64];
int   array2[256];

in dynamically allocated shared memory, one could declare and initialize the arrays the following way:
在动态分配的共享内存中,可以按照以下方式声明和初始化数组:

extern __shared__ float array[];
__device__ void func()      // __device__ or __global__ function
{
    short* array0 = (short*)array;
    float* array1 = (float*)&array0[128];
    int*   array2 =   (int*)&array1[64];
}

Note that pointers need to be aligned to the type they point to, so the following code, for example, does not work since array1 is not aligned to 4 bytes.
请注意,指针需要与它们指向的类型对齐,因此例如以下代码不起作用,因为 array1 未对齐到 4 字节。

extern __shared__ float array[];
__device__ void func()      // __device__ or __global__ function
{
    short* array0 = (short*)array;
    float* array1 = (float*)&array0[127];
}

Alignment requirements for the built-in vector types are listed in Table 5.
内置矢量类型的对齐要求列在表 5 中。

7.2.4. __grid_constant__

The __grid_constant__ annotation for compute architectures greater or equal to 7.0 annotates a const-qualified __global__ function parameter of non-reference type that:
对于计算架构大于或等于 7.0 的 __grid_constant__ 注释,对非引用类型的 __global__ 函数参数进行 const 修饰:

  • Has the lifetime of the grid,
    网格的生命周期是否已到期

  • Is private to the grid, i.e., the object is not accessible to host threads and threads from other grids, including sub-grids,
    对网格是私有的,即对象对主机线程和其他网格的线程,包括子网格,不可访问。

  • Has a distinct object per grid, i.e., all threads in the grid see the same address,
    每个网格都有一个独特的对象,即网格中的所有线程看到相同的地址。

  • Is read-only, i.e., modifying a __grid_constant__ object or any of its sub-objects is undefined behavior, including mutable members.
    是只读的,即修改 __grid_constant__ 对象或其任何子对象都是未定义行为,包括 mutable 成员。

Requirements: 要求:

  • Kernel parameters annotated with __grid_constant__ must have const-qualified non-reference types.
    带有 __grid_constant__ 注释的内核参数必须具有 const 合格的非引用类型。

  • All function declarations must match with respect to any __grid_constant_ parameters.
    所有函数声明必须与任何 __grid_constant_ 参数匹配。

  • A function template specialization must match the primary template declaration with respect to any __grid_constant__ parameters.
    函数模板特化必须与主模板声明相匹配,涉及到任何 __grid_constant__ 参数。

  • A function template instantiation directive must match the primary template declaration with respect to any __grid_constant__ parameters.
    函数模板实例化指令必须与主模板声明匹配,涉及到任何 __grid_constant__ 参数。

If the address of a __global__ function parameter is taken, the compiler will ordinarily make a copy of the kernel parameter in thread local memory and use the address of the copy, to partially support C++ semantics, which allow each thread to modify its own local copy of function parameters. Annotating a __global__ function parameter with __grid_constant__ ensures that the compiler will not create a copy of the kernel parameter in thread local memory, but will instead use the generic address of the parameter itself. Avoiding the local copy may result in improved performance.
如果获取 __global__ 函数参数的地址,则编译器通常会在线程本地内存中制作内核参数的副本,并使用该副本的地址,以部分支持 C++ 语义,允许每个线程修改其自己的函数参数的本地副本。使用 __grid_constant__ 注释 __global__ 函数参数可确保编译器不会在线程本地内存中创建内核参数的副本,而是使用参数本身的通用地址。避免本地副本可能会提高性能。

__device__ void unknown_function(S const&);
__global__ void kernel(const __grid_constant__ S s) {
   s.x += threadIdx.x;  // Undefined Behavior: tried to modify read-only memory

   // Compiler will _not_ create a per-thread thread local copy of "s":
   unknown_function(s);
}

7.2.5. __managed__

The __managed__ memory space specifier, optionally used together with __device__, declares a variable that:
__managed__ 内存空间说明符,可选择与 __device__ 一起使用,声明一个变量,该变量:

  • Can be referenced from both device and host code, for example, its address can be taken or it can be read or written directly from a device or host function.
    可以在设备和主机代码中引用,例如,可以获取其地址,也可以直接从设备或主机函数中读取或写入。

  • Has the lifetime of an application.
    具有应用程序的生命周期。

See __managed__ Memory Space Specifier for more details.
请查看有关__managed__内存空间说明符的更多详细信息。

7.2.6. __restrict__

nvcc supports restricted pointers via the __restrict__ keyword.
nvcc 通过 __restrict__ 关键字支持受限指针。

Restricted pointers were introduced in C99 to alleviate the aliasing problem that exists in C-type languages, and which inhibits all kind of optimization from code re-ordering to common sub-expression elimination.
受限指针是在 C99 中引入的,旨在缓解 C 类型语言中存在的别名问题,该问题抑制了从代码重排序到常见子表达式消除的各种优化。

Here is an example subject to the aliasing issue, where use of restricted pointer can help the compiler to reduce the number of instructions:
这里是一个受到别名问题影响的示例,使用受限指针可以帮助编译器减少指令数量:

void foo(const float* a,
         const float* b,
         float* c)
{
    c[0] = a[0] * b[0];
    c[1] = a[0] * b[0];
    c[2] = a[0] * b[0] * a[1];
    c[3] = a[0] * a[1];
    c[4] = a[0] * b[0];
    c[5] = b[0];
    ...
}

In C-type languages, the pointers a, b, and c may be aliased, so any write through c could modify elements of a or b. This means that to guarantee functional correctness, the compiler cannot load a[0] and b[0] into registers, multiply them, and store the result to both c[0] and c[1], because the results would differ from the abstract execution model if, say, a[0] is really the same location as c[0]. So the compiler cannot take advantage of the common sub-expression. Likewise, the compiler cannot just reorder the computation of c[4] into the proximity of the computation of c[0] and c[1] because the preceding write to c[3] could change the inputs to the computation of c[4].
在 C 类型语言中,指针 abc 可能被别名化,因此通过 c 的任何写入都可能修改 ab 的元素。这意味着为了保证功能正确性,编译器不能将 a[0]b[0] 加载到寄存器中,对它们进行乘法运算,并将结果存储到 c[0]c[1] 中,因为如果 a[0] 确实是与 c[0] 相同的位置,结果将与抽象执行模型不同。因此,编译器无法利用常见子表达式。同样,编译器也不能仅仅重新排列 c[4] 的计算以接近 c[0]c[1] 的计算,因为之前对 c[3] 的写入可能会改变 c[4] 的计算输入。

By making a, b, and c restricted pointers, the programmer asserts to the compiler that the pointers are in fact not aliased, which in this case means writes through c would never overwrite elements of a or b. This changes the function prototype as follows:
通过将 abc 设为受限指针,程序员向编译器断言这些指针实际上不是别名,这意味着通过 c 的写入永远不会覆盖 ab 的元素。这将改变函数原型如下:

void foo(const float* __restrict__ a,
         const float* __restrict__ b,
         float* __restrict__ c);

Note that all pointer arguments need to be made restricted for the compiler optimizer to derive any benefit. With the __restrict__ keywords added, the compiler can now reorder and do common sub-expression elimination at will, while retaining functionality identical with the abstract execution model:
请注意,为了使编译器优化器获益,所有指针参数都需要被限制。添加了 __restrict__ 关键字后,编译器现在可以随意重新排序和进行常见子表达式消除,同时保留与抽象执行模型相同的功能:

void foo(const float* __restrict__ a,
         const float* __restrict__ b,
         float* __restrict__ c)
{
    float t0 = a[0];
    float t1 = b[0];
    float t2 = t0 * t1;
    float t3 = a[1];
    c[0] = t2;
    c[1] = t2;
    c[4] = t2;
    c[2] = t2 * t3;
    c[3] = t0 * t3;
    c[5] = t1;
    ...
}

The effects here are a reduced number of memory accesses and reduced number of computations. This is balanced by an increase in register pressure due to “cached” loads and common sub-expressions.
这里的效果是减少内存访问次数和计算次数。这是通过增加寄存器压力来平衡的,因为存在“缓存”加载和常见子表达式。

Since register pressure is a critical issue in many CUDA codes, use of restricted pointers can have negative performance impact on CUDA code, due to reduced occupancy.
由于寄存器压力是许多 CUDA 代码中的关键问题,使用受限指针可能会对 CUDA 代码产生负面性能影响,因为会降低占用率。

7.3. Built-in Vector Types
7.3. 内置向量类型 

7.3.1. char, short, int, long, longlong, float, double
7.3.1. char、short、int、long、longlong、float、double 

These are vector types derived from the basic integer and floating-point types. They are structures and the 1st, 2nd, 3rd, and 4th components are accessible through the fields x, y, z, and w, respectively. They all come with a constructor function of the form make_<type name>; for example,
这些是从基本整数和浮点类型派生的向量类型。它们是结构体,第 1、2、3 和 4 个分量分别通过字段 xyzw 访问。它们都带有形式为 make_<type name> 的构造函数;例如,

int2 make_int2(int x, int y);

which creates a vector of type int2 with value(x, y).
创建一个类型为 int2 值为 (x, y) 的向量。

The alignment requirements of the vector types are detailed in the following table.
矢量类型的对齐要求详见下表。

Table 5 Alignment Requirements
表 5 对齐要求 

Type 类型

Alignment 对齐

char1, uchar1 char1,uchar1

1

char2, uchar2 char2,uchar2

2

char3, uchar3 char3,uchar3

1

char4, uchar4 char4,uchar4

4

short1, ushort1 short1,ushort1

2

short2, ushort2 short2,ushort2

4

short3, ushort3 short3,ushort3

2

short4, ushort4 short4,ushort4

8

int1, uint1 int1,uint1

4

int2, uint2 int2,uint2

8

int3, uint3 int3,uint3

4

int4, uint4 int4,uint4

16

long1, ulong1 long1,ulong1

4 if sizeof(long) is equal to sizeof(int) 8, otherwise
如果 sizeof(long) 等于 sizeof(int) 8,则为 4,否则

long2, ulong2 long2,ulong2

8 if sizeof(long) is equal to sizeof(int), 16, otherwise
如果 sizeof(long) 等于 sizeof(int),则为 8,否则为 16

long3, ulong3 long3,ulong3

4 if sizeof(long) is equal to sizeof(int), 8, otherwise
4 如果 sizeof(long) 等于 sizeof(int),则为 8,否则

long4, ulong4 long4,ulong4

16

longlong1, ulonglong1 longlong1,ulonglong1

8

longlong2, ulonglong2 longlong2,ulonglong2

16

longlong3, ulonglong3 longlong3,ulonglong3

8

longlong4, ulonglong4 longlong4,ulonglong4

16

float1

4

float2

8

float3

4

float4

16

double1

8

double2

16

double3

8

double4

16

7.3.2. dim3

This type is an integer vector type based on uint3 that is used to specify dimensions. When defining a variable of type dim3, any component left unspecified is initialized to 1.
此类型是基于 uint3 的整数向量类型,用于指定维度。当定义类型为 dim3 的变量时,任何未指定的组件都会初始化为 1。

7.4. Built-in Variables
7.4. 内置变量 

Built-in variables specify the grid and block dimensions and the block and thread indices. They are only valid within functions that are executed on the device.
内置变量指定网格和块维度以及块和线程索引。它们仅在在设备上执行的函数内有效。

7.4.1. gridDim

This variable is of type dim3 (see dim3) and contains the dimensions of the grid.
此变量的类型为 dim3 (请参见 dim3),包含网格的维度。

7.4.2. blockIdx

This variable is of type uint3 (see char, short, int, long, longlong, float, double) and contains the block index within the grid.
此变量的类型为 uint3 (参见 char、short、int、long、longlong、float、double),包含网格中的块索引。

7.4.3. blockDim

This variable is of type dim3 (see dim3) and contains the dimensions of the block.
此变量的类型为 dim3 (请参见 dim3),包含块的维度。

7.4.4. threadIdx

This variable is of type uint3 (see char, short, int, long, longlong, float, double ) and contains the thread index within the block.
此变量的类型为 uint3 (参见 char、short、int、long、longlong、float、double),包含了块内的线程索引。

7.4.5. warpSize

This variable is of type int and contains the warp size in threads (see SIMT Architecture for the definition of a warp).
此变量的类型为 int ,包含线程中的 warp 大小(请参阅 SIMT 架构以了解 warp 的定义)。

7.5. Memory Fence Functions
7.5. 内存栅栏函数 

The CUDA programming model assumes a device with a weakly-ordered memory model, that is the order in which a CUDA thread writes data to shared memory, global memory, page-locked host memory, or the memory of a peer device is not necessarily the order in which the data is observed being written by another CUDA or host thread. It is undefined behavior for two threads to read from or write to the same memory location without synchronization.
CUDA 编程模型假定设备具有弱排序内存模型,即 CUDA 线程将数据写入共享内存、全局内存、页面锁定主机内存或对等设备的内存的顺序不一定是另一个 CUDA 线程或主机线程观察到数据被写入的顺序。未经同步,两个线程从同一内存位置读取或写入是未定义行为。

In the following example, thread 1 executes writeXY(), while thread 2 executes readXY().
在以下示例中,线程 1 执行 writeXY() ,而线程 2 执行 readXY()

__device__ int X = 1, Y = 2;

__device__ void writeXY()
{
    X = 10;
    Y = 20;
}

__device__ void readXY()
{
    int B = Y;
    int A = X;
}

The two threads read and write from the same memory locations X and Y simultaneously. Any data-race is undefined behavior, and has no defined semantics. The resulting values for A and B can be anything.
这两个线程同时从相同的内存位置 XY 读取和写入。任何数据竞争都是未定义行为,没有定义的语义。 AB 的结果值可以是任何值。

Memory fence functions can be used to enforce a sequentially-consistent ordering on memory accesses. The memory fence functions differ in the scope in which the orderings are enforced but they are independent of the accessed memory space (shared memory, global memory, page-locked host memory, and the memory of a peer device).
内存栅栏函数可用于强制内存访问的顺序一致性。内存栅栏函数在强制顺序的范围上有所不同,但它们与访问的内存空间(共享内存、全局内存、页面锁定的主机内存和对等设备的内存)无关。

void __threadfence_block();

is equivalent to cuda::atomic_thread_fence(cuda::memory_order_seq_cst, cuda::thread_scope_block) and ensures that:
等效于 cuda::atomic_thread_fence(cuda::memory_order_seq_cst, cuda::thread_scope_block),并确保:

  • All writes to all memory made by the calling thread before the call to __threadfence_block() are observed by all threads in the block of the calling thread as occurring before all writes to all memory made by the calling thread after the call to __threadfence_block();
    调用 __threadfence_block() 之前由调用线程进行的所有内存写入都被调用线程块中的所有线程观察到,这发生在调用 __threadfence_block() 之后由调用线程进行的所有内存写入之前;

  • All reads from all memory made by the calling thread before the call to __threadfence_block() are ordered before all reads from all memory made by the calling thread after the call to __threadfence_block().
    调用 __threadfence_block() 之前由调用线程从所有内存中读取的所有内容在调用 __threadfence_block() 之后由调用线程从所有内存中读取的所有内容之前排序。

void __threadfence();

is equivalent to cuda::atomic_thread_fence(cuda::memory_order_seq_cst, cuda::thread_scope_device) and ensures that no writes to all memory made by the calling thread after the call to __threadfence() are observed by any thread in the device as occurring before any write to all memory made by the calling thread before the call to __threadfence().
等效于 cuda::atomic_thread_fence(cuda::memory_order_seq_cst, cuda::thread_scope_device),确保在调用 __threadfence() 后由调用线程对所有内存的所有写入都不会被设备中的任何线程观察到,这些写入发生在调用 __threadfence() 之前对所有内存的任何写入之前。

void __threadfence_system();

is equivalent to cuda::atomic_thread_fence(cuda::memory_order_seq_cst, cuda::thread_scope_system) and ensures that all writes to all memory made by the calling thread before the call to __threadfence_system() are observed by all threads in the device, host threads, and all threads in peer devices as occurring before all writes to all memory made by the calling thread after the call to __threadfence_system().
等效于 cuda::atomic_thread_fence(cuda::memory_order_seq_cst, cuda::thread_scope_system),确保调用线程在调用 __threadfence_system() 之前对所有内存的所有写入都被设备中的所有线程、主机线程和对等设备中的所有线程观察到,这些写入发生在调用 __threadfence_system() 之后对调用线程的所有内存的所有写入之前。

__threadfence_system() is only supported by devices of compute capability 2.x and higher.
__threadfence_system() 仅受支持于计算能力为 2.x 及更高的设备。

In the previous code sample, we can insert fences in the codes as follows:
在上一个代码示例中,我们可以按照以下方式在代码中插入围栏:

__device__ int X = 1, Y = 2;

__device__ void writeXY()
{
    X = 10;
    __threadfence();
    Y = 20;
}

__device__ void readXY()
{
    int B = Y;
    __threadfence();
    int A = X;
}

For this code, the following outcomes can be observed:
对于这段代码,可以观察到以下结果:

  • A equal to 1 and B equal to 2,
    A 等于 1, B 等于 2,

  • A equal to 10 and B equal to 2,
    A 等于 10, B 等于 2,

  • A equal to 10 and B equal to 20.
    A 等于 10, B 等于 20。

The fourth outcome is not possible, because the first write must be visible before the second write. If thread 1 and 2 belong to the same block, it is enough to use __threadfence_block(). If thread 1 and 2 do not belong to the same block, __threadfence() must be used if they are CUDA threads from the same device and __threadfence_system() must be used if they are CUDA threads from two different devices.
第四个结果是不可能的,因为第一次写入必须在第二次写入之前可见。如果线程 1 和 2 属于同一块,只需使用 __threadfence_block() 。如果线程 1 和 2 不属于同一块,则必须使用 __threadfence() ,如果它们是来自同一设备的 CUDA 线程,则必须使用 __threadfence_system() ,如果它们是来自两个不同设备的 CUDA 线程。

A common use case is when threads consume some data produced by other threads as illustrated by the following code sample of a kernel that computes the sum of an array of N numbers in one call. Each block first sums a subset of the array and stores the result in global memory. When all blocks are done, the last block done reads each of these partial sums from global memory and sums them to obtain the final result. In order to determine which block is finished last, each block atomically increments a counter to signal that it is done with computing and storing its partial sum (see Atomic Functions about atomic functions). The last block is the one that receives the counter value equal to gridDim.x-1. If no fence is placed between storing the partial sum and incrementing the counter, the counter might increment before the partial sum is stored and therefore, might reach gridDim.x-1 and let the last block start reading partial sums before they have been actually updated in memory.
一个常见的用例是当线程消耗其他线程产生的一些数据时,如下所示的内核代码示例中所示,内核计算一次调用中 N 个数字数组的总和。每个块首先对数组的一个子集求和,并将结果存储在全局内存中。当所有块都完成时,最后一个块会从全局内存中读取这些部分和,并将它们相加以获得最终结果。为了确定哪个块最后完成,每个块都会原子地递增一个计数器,以表示它已经完成了计算并存储了部分和(请参阅有关原子函数的原子函数)。最后一个块是接收计数器值等于 gridDim.x-1 的块。如果在存储部分和和递增计数器之间没有放置栅栏,计数器可能会在部分和存储之前递增,因此可能会达到 gridDim.x-1 ,并让最后一个块在它们实际在内存中更新之前开始读取部分和。

Memory fence functions only affect the ordering of memory operations by a thread; they do not, by themselves, ensure that these memory operations are visible to other threads (like __syncthreads() does for threads within a block (see Synchronization Functions)). In the code sample below, the visibility of memory operations on the result variable is ensured by declaring it as volatile (see Volatile Qualifier).
内存栅栏函数仅影响线程的内存操作顺序;它们本身并不确保这些内存操作对其他线程可见(就像 __syncthreads() 对块内线程的作用一样(请参阅同步函数)。在下面的代码示例中,通过将 result 变量声明为 volatile(请参阅 volatile 限定符),确保了内存操作的可见性。

__device__ unsigned int count = 0;
__shared__ bool isLastBlockDone;
__global__ void sum(const float* array, unsigned int N,
                    volatile float* result)
{
    // Each block sums a subset of the input array.
    float partialSum = calculatePartialSum(array, N);

    if (threadIdx.x == 0) {

        // Thread 0 of each block stores the partial sum
        // to global memory. The compiler will use
        // a store operation that bypasses the L1 cache
        // since the "result" variable is declared as
        // volatile. This ensures that the threads of
        // the last block will read the correct partial
        // sums computed by all other blocks.
        result[blockIdx.x] = partialSum;

        // Thread 0 makes sure that the incrementing
        // of the "count" variable is only performed after
        // the partial sum has been written to global memory.
        __threadfence();

        // Thread 0 signals that it is done.
        unsigned int value = atomicInc(&count, gridDim.x);

        // Thread 0 determines if its block is the last
        // block to be done.
        isLastBlockDone = (value == (gridDim.x - 1));
    }

    // Synchronize to make sure that each thread reads
    // the correct value of isLastBlockDone.
    __syncthreads();

    if (isLastBlockDone) {

        // The last block sums the partial sums
        // stored in result[0 .. gridDim.x-1]
        float totalSum = calculateTotalSum(result);

        if (threadIdx.x == 0) {

            // Thread 0 of last block stores the total sum
            // to global memory and resets the count
            // variable, so that the next kernel call
            // works properly.
            result[0] = totalSum;
            count = 0;
        }
    }
}

7.6. Synchronization Functions
7.6. 同步函数 

void __syncthreads();

waits until all threads in the thread block have reached this point and all global and shared memory accesses made by these threads prior to __syncthreads() are visible to all threads in the block.
等待直到线程块中的所有线程都到达此点,并且在 __syncthreads() 之前由这些线程进行的所有全局和共享内存访问对块中的所有线程都可见。

__syncthreads() is used to coordinate communication between the threads of the same block. When some threads within a block access the same addresses in shared or global memory, there are potential read-after-write, write-after-read, or write-after-write hazards for some of these memory accesses. These data hazards can be avoided by synchronizing threads in-between these accesses.
__syncthreads() 用于协调同一块内线程之间的通信。当块内的一些线程访问共享或全局内存中的相同地址时,一些内存访问可能存在读后写、写后读或写后写的潜在危险。通过在这些访问之间同步线程,可以避免这些数据危险。

__syncthreads() is allowed in conditional code but only if the conditional evaluates identically across the entire thread block, otherwise the code execution is likely to hang or produce unintended side effects.
__syncthreads() 在条件代码中是允许的,但只有当条件在整个线程块中评估时才会产生相同的结果,否则代码执行可能会挂起或产生意外的副作用。

Devices of compute capability 2.x and higher support three variations of __syncthreads() described below.
计算能力为 2.x 及更高的设备支持下面描述的三种 __syncthreads() 变体。

int __syncthreads_count(int predicate);

is identical to __syncthreads() with the additional feature that it evaluates predicate for all threads of the block and returns the number of threads for which predicate evaluates to non-zero.
__syncthreads() 相同,额外功能是对块的所有线程评估谓词,并返回谓词评估为非零的线程数。

int __syncthreads_and(int predicate);

is identical to __syncthreads() with the additional feature that it evaluates predicate for all threads of the block and returns non-zero if and only if predicate evaluates to non-zero for all of them.
__syncthreads() 相同,但具有额外功能,它会对块的所有线程评估谓词,并仅在所有线程的谓词评估为非零时返回非零。

int __syncthreads_or(int predicate);

is identical to __syncthreads() with the additional feature that it evaluates predicate for all threads of the block and returns non-zero if and only if predicate evaluates to non-zero for any of them.
__syncthreads() 相同,额外功能是对块的所有线程评估谓词,并仅在谓词对任何线程评估为非零时返回非零。

void __syncwarp(unsigned mask=0xffffffff);

will cause the executing thread to wait until all warp lanes named in mask have executed a __syncwarp() (with the same mask) before resuming execution. Each calling thread must have its own bit set in the mask and all non-exited threads named in mask must execute a corresponding __syncwarp() with the same mask, or the result is undefined.
将导致执行线程等待,直到所有在掩码中命名的 warp lane 执行了一个 __syncwarp() (使用相同的掩码)后才恢复执行。每个调用线程必须在掩码中设置自己的位,并且所有在掩码中命名的未退出线程必须执行一个相应的 __syncwarp() ,使用相同的掩码,否则结果是未定义的。

Executing __syncwarp() guarantees memory ordering among threads participating in the barrier. Thus, threads within a warp that wish to communicate via memory can store to memory, execute __syncwarp(), and then safely read values stored by other threads in the warp.
执行 __syncwarp() 可确保参与屏障的线程之间的内存排序。因此,希望通过内存进行通信的 warp 内的线程可以存储到内存,执行 __syncwarp() ,然后安全地读取 warp 中其他线程存储的值。

Note 注意

For .target sm_6x or below, all threads in mask must execute the same __syncwarp() in convergence, and the union of all values in mask must be equal to the active mask. Otherwise, the behavior is undefined.
对于 .target sm_6x 或更低版本,mask 中的所有线程必须以相同的 __syncwarp() 并行执行,并且 mask 中所有值的并集必须等于活动 mask。否则,行为是未定义的。

7.7. Mathematical Functions
7.7. 数学函数 

The reference manual lists all C/C++ standard library mathematical functions that are supported in device code and all intrinsic functions that are only supported in device code.
参考手册列出了所有在设备代码中支持的 C/C++标准库数学函数,以及所有仅在设备代码中支持的内部函数。

Mathematical Functions provides accuracy information for some of these functions when relevant.
数学函数在相关情况下提供了一些函数的准确性信息。

7.8. Texture Functions
7.8. 纹理函数 

Texture objects are described in Texture Object API
纹理对象在纹理对象 API 中进行描述

Texture fetching is described in Texture Fetching.
纹理获取在纹理获取中描述。

7.8.1. Texture Object API
7.8.1. 纹理对象 API 

7.8.1.1. tex1Dfetch()

template<class T>
T tex1Dfetch(cudaTextureObject_t texObj, int x);

fetches from the region of linear memory specified by the one-dimensional texture object texObj using integer texture coordinate x. tex1Dfetch() only works with non-normalized coordinates, so only the border and clamp addressing modes are supported. It does not perform any texture filtering. For integer types, it may optionally promote the integer to single-precision floating point.
从由整数纹理坐标 x 指定的线性内存区域中提取, tex1Dfetch() 仅适用于非规范化坐标,因此仅支持边界和夹紧寻址模式。它不执行任何纹理过滤。对于整数类型,可以选择将整数提升为单精度浮点数。

7.8.1.2. tex1D()

template<class T>
T tex1D(cudaTextureObject_t texObj, float x);

fetches from the CUDA array specified by the one-dimensional texture object texObj using texture coordinate x.
从由一维纹理对象 texObj 指定的 CUDA 数组中使用纹理坐标 x 检索。

7.8.1.3. tex1DLod()

template<class T>
T tex1DLod(cudaTextureObject_t texObj, float x, float level);

fetches from the CUDA array specified by the one-dimensional texture object texObj using texture coordinate x at the level-of-detail level.
从一维纹理对象 texObj 指定的 CUDA 数组中使用纹理坐标 x 在细节级别 level 处提取。

7.8.1.4. tex1DGrad()

template<class T>
T tex1DGrad(cudaTextureObject_t texObj, float x, float dx, float dy);

fetches from the CUDA array specified by the one-dimensional texture object texObj using texture coordinate x. The level-of-detail is derived from the X-gradient dx and Y-gradient dy.
从由一维纹理对象 texObj 指定的 CUDA 数组中使用纹理坐标 x 检索。细节级别源自 X 梯度 dx 和 Y 梯度 dy

7.8.1.5. tex2D()

template<class T>
T tex2D(cudaTextureObject_t texObj, float x, float y);

fetches from the CUDA array or the region of linear memory specified by the two-dimensional texture object texObj using texture coordinate (x,y).
从由二维纹理对象 texObj 指定的 CUDA 数组或线性内存区域使用纹理坐标 (x,y) 获取。

7.8.1.6. tex2D() for sparse CUDA arrays
7.8.1.6. 用于稀疏 CUDA 数组的 tex2D() 

                template<class T>
T tex2D(cudaTextureObject_t texObj, float x, float y, bool* isResident);

fetches from the CUDA array specified by the two-dimensional texture object texObj using texture coordinate (x,y). Also returns whether the texel is resident in memory via isResident pointer. If not, the values fetched will be zeros.
从由二维纹理对象 texObj 指定的 CUDA 数组中提取,使用纹理坐标 (x,y) 。还通过 isResident 指针返回纹素是否驻留在内存中。如果没有,提取的值将为零。

7.8.1.7. tex2Dgather()

template<class T>
T tex2Dgather(cudaTextureObject_t texObj,
              float x, float y, int comp = 0);

fetches from the CUDA array specified by the 2D texture object texObj using texture coordinates x and y and the comp parameter as described in Texture Gather.
从由 2D 纹理对象 texObj 指定的 CUDA 数组中提取,使用纹理坐标 xy 以及 Texture Gather 中描述的 comp 参数。

7.8.1.8. tex2Dgather() for sparse CUDA arrays
7.8.1.8. 用于稀疏 CUDA 数组的 tex2Dgather() 

                template<class T>
T tex2Dgather(cudaTextureObject_t texObj,
            float x, float y, bool* isResident, int comp = 0);

fetches from the CUDA array specified by the 2D texture object texObj using texture coordinates x and y and the comp parameter as described in Texture Gather. Also returns whether the texel is resident in memory via isResident pointer. If not, the values fetched will be zeros.
从由 2D 纹理对象 texObj 指定的 CUDA 数组中提取,使用纹理坐标 xy 以及 Texture Gather 中描述的 comp 参数。还通过 isResident 指针返回纹素是否驻留在内存中。如果没有,提取的值将为零。

7.8.1.9. tex2DGrad()

template<class T>
T tex2DGrad(cudaTextureObject_t texObj, float x, float y,
            float2 dx, float2 dy);

fetches from the CUDA array specified by the two-dimensional texture object texObj using texture coordinate (x,y). The level-of-detail is derived from the dx and dy gradients.
从由二维纹理对象 texObj 指定的 CUDA 数组中使用纹理坐标 (x,y) 检索。细节级别源自 dxdy 梯度。

7.8.1.10. tex2DGrad() for sparse CUDA arrays
7.8.1.10. 用于稀疏 CUDA 数组的 tex2DGrad() 

                template<class T>
T tex2DGrad(cudaTextureObject_t texObj, float x, float y,
        float2 dx, float2 dy, bool* isResident);

fetches from the CUDA array specified by the two-dimensional texture object texObj using texture coordinate (x,y). The level-of-detail is derived from the dx and dy gradients. Also returns whether the texel is resident in memory via isResident pointer. If not, the values fetched will be zeros.
从由二维纹理对象 texObj 指定的 CUDA 数组中使用纹理坐标 (x,y) 获取。细节级别是根据 dxdy 梯度派生的。还通过 isResident 指针返回纹素是否驻留在内存中。如果没有,获取的值将为零。

7.8.1.11. tex2DLod()

template<class T>
tex2DLod(cudaTextureObject_t texObj, float x, float y, float level);

fetches from the CUDA array or the region of linear memory specified by the two-dimensional texture object texObj using texture coordinate (x,y) at level-of-detail level.
使用纹理坐标 (x,y) 在细节级别 level 处从 CUDA 数组或由二维纹理对象 texObj 指定的线性内存区域获取。

7.8.1.12. tex2DLod() for sparse CUDA arrays
7.8.1.12. 用于稀疏 CUDA 数组的 tex2DLod() 

        template<class T>
tex2DLod(cudaTextureObject_t texObj, float x, float y, float level, bool* isResident);

fetches from the CUDA array specified by the two-dimensional texture object texObj using texture coordinate (x,y) at level-of-detail level. Also returns whether the texel is resident in memory via isResident pointer. If not, the values fetched will be zeros.
从由二维纹理对象 texObj 指定的 CUDA 数组中提取,使用纹理坐标 (x,y) 在细节级别 level 处。还通过 isResident 指针返回纹素是否驻留在内存中。如果没有,提取的值将为零。

7.8.1.13. tex3D()

template<class T>
T tex3D(cudaTextureObject_t texObj, float x, float y, float z);

fetches from the CUDA array specified by the three-dimensional texture object texObj using texture coordinate (x,y,z).
使用纹理坐标 (x,y,z) 从由三维纹理对象 texObj 指定的 CUDA 数组中提取。

7.8.1.14. tex3D() for sparse CUDA arrays
7.8.1.14. 用于稀疏 CUDA 数组的 tex3D() 

                template<class T>
T tex3D(cudaTextureObject_t texObj, float x, float y, float z, bool* isResident);

fetches from the CUDA array specified by the three-dimensional texture object texObj using texture coordinate (x,y,z). Also returns whether the texel is resident in memory via isResident pointer. If not, the values fetched will be zeros.
从由三维纹理对象 texObj 指定的 CUDA 数组中提取,使用纹理坐标 (x,y,z) 。还通过 isResident 指针返回纹素是否驻留在内存中。如果没有,提取的值将为零。

7.8.1.15. tex3DLod()

template<class T>
T tex3DLod(cudaTextureObject_t texObj, float x, float y, float z, float level);

fetches from the CUDA array or the region of linear memory specified by the three-dimensional texture object texObj using texture coordinate (x,y,z) at level-of-detail level.
从由三维纹理对象 texObj 指定的 CUDA 数组或线性内存区域使用纹理坐标 (x,y,z) 在细节级别 level 处获取。

7.8.1.16. tex3DLod() for sparse CUDA arrays
7.8.1.16. 用于稀疏 CUDA 数组的 tex3DLod() 

                template<class T>
T tex3DLod(cudaTextureObject_t texObj, float x, float y, float z, float level, bool* isResident);

fetches from the CUDA array or the region of linear memory specified by the three-dimensional texture object texObj using texture coordinate (x,y,z) at level-of-detail level. Also returns whether the texel is resident in memory via isResident pointer. If not, the values fetched will be zeros.
使用纹理坐标 (x,y,z) 在细节级别 level 处从 CUDA 数组或三维纹理对象 texObj 指定的线性内存区域获取。还通过 isResident 指针返回纹素是否驻留在内存中。如果没有,获取的值将为零。

7.8.1.17. tex3DGrad()

template<class T>
T tex3DGrad(cudaTextureObject_t texObj, float x, float y, float z,
            float4 dx, float4 dy);

fetches from the CUDA array specified by the three-dimensional texture object texObj using texture coordinate (x,y,z) at a level-of-detail derived from the X and Y gradients dx and dy.
从由三维纹理对象 texObj 指定的 CUDA 数组中提取,使用来自 X 和 Y 梯度 dxdy 的细节级别的纹理坐标 (x,y,z)

7.8.1.18. tex3DGrad() for sparse CUDA arrays
7.8.1.18. 用于稀疏 CUDA 数组的 tex3DGrad() 

                template<class T>
T tex3DGrad(cudaTextureObject_t texObj, float x, float y, float z,
        float4 dx, float4 dy, bool* isResident);

fetches from the CUDA array specified by the three-dimensional texture object texObj using texture coordinate (x,y,z) at a level-of-detail derived from the X and Y gradients dx and dy. Also returns whether the texel is resident in memory via isResident pointer. If not, the values fetched will be zeros.
从由三维纹理对象 texObj 指定的 CUDA 数组中使用纹理坐标 (x,y,z) 获取,使用从 X 和 Y 梯度 dxdy 派生的细节级别。还通过 isResident 指针返回纹素是否驻留在内存中。如果没有,获取的值将为零。

7.8.1.19. tex1DLayered()

template<class T>
T tex1DLayered(cudaTextureObject_t texObj, float x, int layer);

fetches from the CUDA array specified by the one-dimensional texture object texObj using texture coordinate x and index layer, as described in Layered Textures
从一维纹理对象 texObj 指定的 CUDA 数组中提取,使用纹理坐标 x 和索引 layer ,如在分层纹理中所述

7.8.1.20. tex1DLayeredLod()

template<class T>
T tex1DLayeredLod(cudaTextureObject_t texObj, float x, int layer, float level);

fetches from the CUDA array specified by the one-dimensional layered texture at layer layer using texture coordinate x and level-of-detail level.
从一维分层纹理指定的 CUDA 数组中提取图层 layer ,使用纹理坐标 x 和细节级别 level

7.8.1.21. tex1DLayeredGrad()

template<class T>
T tex1DLayeredGrad(cudaTextureObject_t texObj, float x, int layer,
                   float dx, float dy);

fetches from the CUDA array specified by the one-dimensional layered texture at layer layer using texture coordinate x and a level-of-detail derived from the dx and dy gradients.
使用从一维分层纹理的第 layer 层指定的 CUDA 数组获取,使用纹理坐标 x 和从 dxdy 梯度派生的细节级别。

7.8.1.22. tex2DLayered()

template<class T>
T tex2DLayered(cudaTextureObject_t texObj,
               float x, float y, int layer);

fetches from the CUDA array specified by the two-dimensional texture object texObj using texture coordinate (x,y) and index layer, as described in Layered Textures.
从二维纹理对象 texObj 指定的 CUDA 数组中提取,使用纹理坐标 (x,y) 和索引 layer ,如在分层纹理中所述。

7.8.1.23. tex2DLayered() for sparse CUDA arrays
7.8.1.23. 用于稀疏 CUDA 数组的 tex2DLayered() 

                template<class T>
T tex2DLayered(cudaTextureObject_t texObj,
            float x, float y, int layer, bool* isResident);

fetches from the CUDA array specified by the two-dimensional texture object texObj using texture coordinate (x,y) and index layer, as described in Layered Textures. Also returns whether the texel is resident in memory via isResident pointer. If not, the values fetched will be zeros.
从由二维纹理对象 texObj 指定的 CUDA 数组中提取,使用纹理坐标 (x,y) 和索引 layer ,如在分层纹理中描述的那样。还通过 isResident 指针返回纹素是否驻留在内存中。如果没有,提取的值将为零。

7.8.1.24. tex2DLayeredLod()

template<class T>
T tex2DLayeredLod(cudaTextureObject_t texObj, float x, float y, int layer,
                  float level);

fetches from the CUDA array specified by the two-dimensional layered texture at layer layer using texture coordinate (x,y).
使用纹理坐标 (x,y) 从二维分层纹理指定的 CUDA 数组中提取图层 layer

7.8.1.25. tex2DLayeredLod() for sparse CUDA arrays
7.8.1.25. 用于稀疏 CUDA 数组的 tex2DLayeredLod() 

                template<class T>
T tex2DLayeredLod(cudaTextureObject_t texObj, float x, float y, int layer,
                float level, bool* isResident);

fetches from the CUDA array specified by the two-dimensional layered texture at layer layer using texture coordinate (x,y). Also returns whether the texel is resident in memory via isResident pointer. If not, the values fetched will be zeros.
从由二维分层纹理指定的 CUDA 数组中使用纹理坐标 (x,y) 获取第 layer 层的数据。还通过 isResident 指针返回纹素是否驻留在内存中。如果没有,获取的值将为零。

7.8.1.26. tex2DLayeredGrad()

template<class T>
T tex2DLayeredGrad(cudaTextureObject_t texObj, float x, float y, int layer,
                   float2 dx, float2 dy);

fetches from the CUDA array specified by the two-dimensional layered texture at layer layer using texture coordinate (x,y) and a level-of-detail derived from the dx and dy gradients.
从二维分层纹理指定的 CUDA 数组中提取,使用纹理坐标 layer 和从 dxdy 梯度派生的细节级别。

7.8.1.27. tex2DLayeredGrad() for sparse CUDA arrays
7.8.1.27. 用于稀疏 CUDA 数组的 tex2DLayeredGrad() 

                template<class T>
T tex2DLayeredGrad(cudaTextureObject_t texObj, float x, float y, int layer,
                float2 dx, float2 dy, bool* isResident);

fetches from the CUDA array specified by the two-dimensional layered texture at layer layer using texture coordinate (x,y) and a level-of-detail derived from the dx and dy gradients. Also returns whether the texel is resident in memory via isResident pointer. If not, the values fetched will be zeros.
从二维分层纹理指定的 CUDA 数组中提取,使用纹理坐标 layer 和从 dxdy 梯度派生的细节级别。还通过 isResident 指针返回纹素是否驻留在内存中。如果没有,提取的值将为零。

7.8.1.28. texCubemap()

template<class T>
T texCubemap(cudaTextureObject_t texObj, float x, float y, float z);

fetches the CUDA array specified by the cubemap texture object texObj using texture coordinate (x,y,z), as described in Cubemap Textures.
使用纹理坐标 (x,y,z) 获取由立方体贴图纹理对象 texObj 指定的 CUDA 数组,如立方体贴图纹理中所述。

7.8.1.29. texCubemapGrad()

template<class T>
T texCubemapGrad(cudaTextureObject_t texObj, float x, float, y, float z,
                float4 dx, float4 dy);

fetches from the CUDA array specified by the cubemap texture object texObj using texture coordinate (x,y,z) as described in Cubemap Textures. The level-of-detail used is derived from the dx and dy gradients.
使用立方体贴图纹理对象 texObj 指定的 CUDA 数组获取,使用纹理坐标 (x,y,z) ,如在立方体贴图纹理中所述。使用的细节级别是从 dxdy 梯度派生的。

7.8.1.30. texCubemapLod()

template<class T>
T texCubemapLod(cudaTextureObject_t texObj, float x, float, y, float z,
                float level);

fetches from the CUDA array specified by the cubemap texture object texObj using texture coordinate (x,y,z) as described in Cubemap Textures. The level-of-detail used is given by level.
使用立方体贴图纹理对象 texObj 指定的 CUDA 数组获取,使用纹理坐标 (x,y,z) ,如立方体贴图纹理中所述。使用的细节级别由 level 给出。

7.8.1.31. texCubemapLayered()

template<class T>
T texCubemapLayered(cudaTextureObject_t texObj,
                    float x, float y, float z, int layer);

fetches from the CUDA array specified by the cubemap layered texture object texObj using texture coordinates (x,y,z), and index layer, as described in Cubemap Layered Textures.
使用纹理坐标 (x,y,z) 和索引 layer 从由立方体贴图分层纹理对象 texObj 指定的 CUDA 数组中提取数据,如立方体贴图分层纹理中所述。

7.8.1.32. texCubemapLayeredGrad()

template<class T>
T texCubemapLayeredGrad(cudaTextureObject_t texObj, float x, float y, float z,
                       int layer, float4 dx, float4 dy);

fetches from the CUDA array specified by the cubemap layered texture object texObj using texture coordinate (x,y,z) and index layer, as described in Cubemap Layered Textures, at level-of-detail derived from the dx and dy gradients.
从由纹理坐标 (x,y,z) 和索引 layer 指定的 CUDA 数组中提取,如在立方体贴图分层纹理中描述的那样,使用从 dxdy 梯度派生的细节级别。

7.8.1.33. texCubemapLayeredLod()

template<class T>
T texCubemapLayeredLod(cudaTextureObject_t texObj, float x, float y, float z,
                       int layer, float level);

fetches from the CUDA array specified by the cubemap layered texture object texObj using texture coordinate (x,y,z) and index layer, as described in Cubemap Layered Textures, at level-of-detail level level.
使用纹理坐标 (x,y,z) 和索引 layer 从由立方体贴图分层纹理对象 texObj 指定的 CUDA 数组中提取,如在立方体贴图分层纹理中描述的,详细级别为 level

7.9. Surface Functions
7.9. 表面函数 

Surface functions are only supported by devices of compute capability 2.0 and higher.
表面函数仅受支持的计算能力为 2.0 及更高的设备。

Surface objects are described in described in Surface Object API
表面对象在表面对象 API 中描述

In the sections below, boundaryMode specifies the boundary mode, that is how out-of-range surface coordinates are handled; it is equal to either cudaBoundaryModeClamp, in which case out-of-range coordinates are clamped to the valid range, or cudaBoundaryModeZero, in which case out-of-range reads return zero and out-of-range writes are ignored, or cudaBoundaryModeTrap, in which case out-of-range accesses cause the kernel execution to fail.
在下面的部分中, boundaryMode 指定边界模式,即处理超出范围的表面坐标的方式;它等于 cudaBoundaryModeClamp ,在这种情况下,超出范围的坐标被夹在有效范围内,或者等于 cudaBoundaryModeZero ,在这种情况下,超出范围的读取返回零并且超出范围的写入被忽略,或者等于 cudaBoundaryModeTrap ,在这种情况下,超出范围的访问会导致内核执行失败。

7.9.1. Surface Object API
7.9.1. 表面对象 API 

7.9.1.1. surf1Dread()

template<class T>
T surf1Dread(cudaSurfaceObject_t surfObj, int x,
               boundaryMode = cudaBoundaryModeTrap);

reads the CUDA array specified by the one-dimensional surface object surfObj using byte coordinate x.
使用字节坐标 x 读取由一维表面对象 surfObj 指定的 CUDA 数组。

7.9.1.2. surf1Dwrite

template<class T>
void surf1Dwrite(T data,
                  cudaSurfaceObject_t surfObj,
                  int x,
                  boundaryMode = cudaBoundaryModeTrap);

writes value data to the CUDA array specified by the one-dimensional surface object surfObj at byte coordinate x.
将值数据写入由一维表面对象 surfObj 指定的 CUDA 数组,位于字节坐标 x。

7.9.1.3. surf2Dread()

template<class T>
T surf2Dread(cudaSurfaceObject_t surfObj,
              int x, int y,
              boundaryMode = cudaBoundaryModeTrap);
template<class T>
void surf2Dread(T* data,
                 cudaSurfaceObject_t surfObj,
                 int x, int y,
                 boundaryMode = cudaBoundaryModeTrap);

reads the CUDA array specified by the two-dimensional surface object surfObj using byte coordinates x and y.
使用字节坐标 x 和 y 读取由二维表面对象 surfObj 指定的 CUDA 数组。

7.9.1.4. surf2Dwrite()

template<class T>
void surf2Dwrite(T data,
                  cudaSurfaceObject_t surfObj,
                  int x, int y,
                  boundaryMode = cudaBoundaryModeTrap);

writes value data to the CUDA array specified by the two-dimensional surface object surfObj at byte coordinate x and y.
将值数据写入二维表面对象 surfObj 指定的 CUDA 数组,位于字节坐标 x 和 y 处。

7.9.1.5. surf3Dread()

template<class T>
T surf3Dread(cudaSurfaceObject_t surfObj,
              int x, int y, int z,
              boundaryMode = cudaBoundaryModeTrap);
template<class T>
void surf3Dread(T* data,
                 cudaSurfaceObject_t surfObj,
                 int x, int y, int z,
                 boundaryMode = cudaBoundaryModeTrap);

reads the CUDA array specified by the three-dimensional surface object surfObj using byte coordinates x, y, and z.
使用字节坐标 x、y 和 z 读取由三维表面对象 surfObj 指定的 CUDA 数组。

7.9.1.6. surf3Dwrite()

template<class T>
void surf3Dwrite(T data,
                  cudaSurfaceObject_t surfObj,
                  int x, int y, int z,
                  boundaryMode = cudaBoundaryModeTrap);

writes value data to the CUDA array specified by the three-dimensional object surfObj at byte coordinate x, y, and z.
将数值数据写入由三维对象 surfObj 指定的 CUDA 数组,其字节坐标为 x、y 和 z。

7.9.1.7. surf1DLayeredread()

template<class T>
T surf1DLayeredread(
                 cudaSurfaceObject_t surfObj,
                 int x, int layer,
                 boundaryMode = cudaBoundaryModeTrap);
template<class T>
void surf1DLayeredread(T data,
                 cudaSurfaceObject_t surfObj,
                 int x, int layer,
                 boundaryMode = cudaBoundaryModeTrap);

reads the CUDA array specified by the one-dimensional layered surface object surfObj using byte coordinate x and index layer.
使用字节坐标 x 和索引 layer 读取由一维分层表面对象 surfObj 指定的 CUDA 数组。

7.9.1.8. surf1DLayeredwrite()

template<class Type>
void surf1DLayeredwrite(T data,
                 cudaSurfaceObject_t surfObj,
                 int x, int layer,
                 boundaryMode = cudaBoundaryModeTrap);

writes value data to the CUDA array specified by the two-dimensional layered surface object surfObj at byte coordinate x and index layer.
将数值数据写入由二维分层表面对象 surfObj 指定的 CUDA 数组,位于字节坐标 x 和索引 layer 处。

7.9.1.9. surf2DLayeredread()

template<class T>
T surf2DLayeredread(
                 cudaSurfaceObject_t surfObj,
                 int x, int y, int layer,
                 boundaryMode = cudaBoundaryModeTrap);
template<class T>
void surf2DLayeredread(T data,
                         cudaSurfaceObject_t surfObj,
                         int x, int y, int layer,
                         boundaryMode = cudaBoundaryModeTrap);

reads the CUDA array specified by the two-dimensional layered surface object surfObj using byte coordinate x and y, and index layer.
使用字节坐标 x 和 y 以及索引 layer 读取由二维分层表面对象 surfObj 指定的 CUDA 数组。

7.9.1.10. surf2DLayeredwrite()

template<class T>
void surf2DLayeredwrite(T data,
                          cudaSurfaceObject_t surfObj,
                          int x, int y, int layer,
                          boundaryMode = cudaBoundaryModeTrap);

writes value data to the CUDA array specified by the one-dimensional layered surface object surfObj at byte coordinate x and y, and index layer.
将数值数据写入由一维分层表面对象 surfObj 指定的 CUDA 数组,位于字节坐标 x 和 y 以及索引 layer

7.9.1.11. surfCubemapread()

template<class T>
T surfCubemapread(
                 cudaSurfaceObject_t surfObj,
                 int x, int y, int face,
                 boundaryMode = cudaBoundaryModeTrap);
template<class T>
void surfCubemapread(T data,
                 cudaSurfaceObject_t surfObj,
                 int x, int y, int face,
                 boundaryMode = cudaBoundaryModeTrap);

reads the CUDA array specified by the cubemap surface object surfObj using byte coordinate x and y, and face index face.
使用字节坐标 x 和 y 读取由立方体贴图表面对象 surfObj 指定的 CUDA 数组,并使用面索引 face。

7.9.1.12. surfCubemapwrite()

template<class T>
void surfCubemapwrite(T data,
                 cudaSurfaceObject_t surfObj,
                 int x, int y, int face,
                 boundaryMode = cudaBoundaryModeTrap);

writes value data to the CUDA array specified by the cubemap object surfObj at byte coordinate x and y, and face index face.
将值数据写入由 cubemap 对象 surfObj 指定的 CUDA 数组,位于字节坐标 x 和 y,以及面索引 face。

7.9.1.13. surfCubemapLayeredread()

template<class T>
T surfCubemapLayeredread(
             cudaSurfaceObject_t surfObj,
             int x, int y, int layerFace,
             boundaryMode = cudaBoundaryModeTrap);
template<class T>
void surfCubemapLayeredread(T data,
             cudaSurfaceObject_t surfObj,
             int x, int y, int layerFace,
             boundaryMode = cudaBoundaryModeTrap);

reads the CUDA array specified by the cubemap layered surface object surfObj using byte coordinate x and y, and index layerFace.
使用字节坐标 x 和 y 以及索引 layerFace. 读取由立方体贴图分层表面对象 surfObj 指定的 CUDA 数组

7.9.1.14. surfCubemapLayeredwrite()

template<class T>
void surfCubemapLayeredwrite(T data,
             cudaSurfaceObject_t surfObj,
             int x, int y, int layerFace,
             boundaryMode = cudaBoundaryModeTrap);

writes value data to the CUDA array specified by the cubemap layered object surfObj at byte coordinate x and y, and index layerFace.
将值数据写入由立方体贴图分层对象 surfObj 指定的 CUDA 数组,字节坐标为 xy ,索引为 layerFace

7.10. Read-Only Data Cache Load Function
7.10. 只读数据缓存加载函数 

The read-only data cache load function is only supported by devices of compute capability 5.0 and higher.
只有计算能力为 5.0 及更高的设备才支持只读数据缓存加载功能。

T __ldg(const T* address);

returns the data of type T located at address address, where T is char, signed char, short, int, long, long longunsigned char, unsigned short, unsigned int, unsigned long, unsigned long long, char2, char4, short2, short4, int2, int4, longlong2uchar2, uchar4, ushort2, ushort4, uint2, uint4, ulonglong2float, float2, float4, double, or double2. With the cuda_fp16.h header included, T can be __half or __half2. Similarly, with the cuda_bf16.h header included, T can also be __nv_bfloat16 or __nv_bfloat162. The operation is cached in the read-only data cache (see Global Memory).
返回位于地址 address 处的类型为 T 的数据,其中 Tcharsigned charshortintlonglong long unsigned charunsigned shortunsigned intunsigned longunsigned long longchar2char4short2short4int2int4longlong2 uchar2uchar4ushort2ushort4uint2uint4ulonglong2 floatfloat2float4doubledouble2 。包含 cuda_fp16.h 头时, T 可以是 __half__half2 。同样,包含 cuda_bf16.h 头时, T 也可以是 __nv_bfloat16__nv_bfloat162 。该操作被缓存在只读数据缓存中(请参阅全局内存)。

7.11. Load Functions Using Cache Hints
7.11. 使用缓存提示加载函数 

These load functions are only supported by devices of compute capability 5.0 and higher.
这些加载函数仅受到计算能力为 5.0 及更高的设备支持。

T __ldcg(const T* address);
T __ldca(const T* address);
T __ldcs(const T* address);
T __ldlu(const T* address);
T __ldcv(const T* address);

returns the data of type T located at address address, where T is char, signed char, short, int, long, long longunsigned char, unsigned short, unsigned int, unsigned long, unsigned long long, char2, char4, short2, short4, int2, int4, longlong2uchar2, uchar4, ushort2, ushort4, uint2, uint4, ulonglong2float, float2, float4, double, or double2. With the cuda_fp16.h header included, T can be __half or __half2. Similarly, with the cuda_bf16.h header included, T can also be __nv_bfloat16 or __nv_bfloat162. The operation is using the corresponding cache operator (see PTX ISA)
返回位于地址 address 处的类型为 T 的数据,其中 Tcharsigned charshortintlonglong long unsigned charunsigned shortunsigned intunsigned longunsigned long longchar2char4short2short4int2int4longlong2 uchar2uchar4ushort2ushort4uint2uint4ulonglong2 floatfloat2float4double ,或 double2 。包含 cuda_fp16.h 头时, T 可以是 __half__half2 。同样,包含 cuda_bf16.h 头时, T 也可以是 __nv_bfloat16__nv_bfloat162 。该操作使用相应的缓存操作符(请参阅 PTX ISA)。

7.12. Store Functions Using Cache Hints
7.12. 使用缓存提示存储函数 

These store functions are only supported by devices of compute capability 5.0 and higher.
这些存储函数仅受到计算能力为 5.0 及更高的设备支持。

void __stwb(T* address, T value);
void __stcg(T* address, T value);
void __stcs(T* address, T value);
void __stwt(T* address, T value);

stores the value argument of type T to the location at address address, where T is char, signed char, short, int, long, long longunsigned char, unsigned short, unsigned int, unsigned long, unsigned long long, char2, char4, short2, short4, int2, int4, longlong2uchar2, uchar4, ushort2, ushort4, uint2, uint4, ulonglong2float, float2, float4, double, or double2. With the cuda_fp16.h header included, T can be __half or __half2. Similarly, with the cuda_bf16.h header included, T can also be __nv_bfloat16 or __nv_bfloat162. The operation is using the corresponding cache operator (see PTX ISA )
将类型为 Tvalue 参数存储到地址为 address 的位置,其中 Tcharsigned charshortintlonglong long unsigned charunsigned shortunsigned intunsigned longunsigned long longchar2char4short2short4int2int4longlong2 uchar2uchar4ushort2ushort4uint2uint4ulonglong2 floatfloat2float4doubledouble2 。包括 cuda_fp16.h 头时, T 可以是 __half__half2 。同样,包括 cuda_bf16.h 头时, T 也可以是 __nv_bfloat16__nv_bfloat162 。该操作使用相应的缓存操作符(请参阅 PTX ISA)。

7.13. Time Function
7.13. 时间函数 

clock_t clock();
long long int clock64();

when executed in device code, returns the value of a per-multiprocessor counter that is incremented every clock cycle. Sampling this counter at the beginning and at the end of a kernel, taking the difference of the two samples, and recording the result per thread provides a measure for each thread of the number of clock cycles taken by the device to completely execute the thread, but not of the number of clock cycles the device actually spent executing thread instructions. The former number is greater than the latter since threads are time sliced.
在设备代码中执行时,返回每个多处理器计数器的值,该计数器在每个时钟周期递增。在内核的开头和结尾对该计数器进行采样,取两个样本的差值,并记录每个线程的结果,为每个线程提供设备完全执行线程所需的时钟周期数,但不提供设备实际执行线程指令所需的时钟周期数。前者的数字大于后者,因为线程是分时的。

7.14. Atomic Functions
7.14. 原子函数 

An atomic function performs a read-modify-write atomic operation on one 32-bit, 64-bit, or 128-bit word residing in global or shared memory. In the case of float2 or float4, the read-modify-write operation is performed on each element of the vector residing in global memory. For example, atomicAdd() reads a word at some address in global or shared memory, adds a number to it, and writes the result back to the same address. Atomic functions can only be used in device functions.
原子函数在全局或共享内存中执行对一个 32 位、64 位或 128 位字的读取-修改-写入原子操作。在 float2float4 的情况下,读取-修改-写入操作将在全局内存中向量的每个元素上执行。例如, atomicAdd() 在全局或共享内存中的某个地址读取一个字,将一个数字添加到它,并将结果写回到相同的地址。原子函数只能在设备函数中使用。

The atomic functions described in this section have ordering cuda::memory_order_relaxed and are only atomic at a particular scope:
本节中描述的原子函数具有 cuda::memory_order_relaxed 排序,并且仅在特定范围内是原子的:

  • Atomic APIs with _system suffix (example: atomicAdd_system) are atomic at scope cuda::thread_scope_system if they meet particular conditions.
    带有 _system 后缀的原子 API(例如: atomicAdd_system )在特定条件下,如果符合特定条件,则在作用域 cuda::thread_scope_system 上是原子的。

  • Atomic APIs without a suffix (example: atomicAdd) are atomic at scope cuda::thread_scope_device.
    不带后缀的原子 API(例如: atomicAdd )在作用域 cuda::thread_scope_device 上是原子的。

  • Atomic APIs with _block suffix (example: atomicAdd_block) are atomic at scope cuda::thread_scope_block.
    原子 API 与 _block 后缀(例如: atomicAdd_block )在作用域 cuda::thread_scope_block 上是原子的。

In the following example both the CPU and the GPU atomically update an integer value at address addr:
在以下示例中,CPU 和 GPU 都以原子方式更新地址为 addr 处的整数值:

__global__ void mykernel(int *addr) {
  atomicAdd_system(addr, 10);       // only available on devices with compute capability 6.x
}

void foo() {
  int *addr;
  cudaMallocManaged(&addr, 4);
  *addr = 0;

   mykernel<<<...>>>(addr);
   __sync_fetch_and_add(addr, 10);  // CPU atomic operation
}

Note that any atomic operation can be implemented based on atomicCAS() (Compare And Swap). For example, atomicAdd() for double-precision floating-point numbers is not available on devices with compute capability lower than 6.0 but it can be implemented as follows:
请注意,任何原子操作都可以基于 atomicCAS() (比较并交换)来实现。例如,对于双精度浮点数,如果设备的计算能力低于 6.0,则 atomicAdd() 不可用,但可以按以下方式实现:

#if __CUDA_ARCH__ < 600
__device__ double atomicAdd(double* address, double val)
{
    unsigned long long int* address_as_ull =
                              (unsigned long long int*)address;
    unsigned long long int old = *address_as_ull, assumed;

    do {
        assumed = old;
        old = atomicCAS(address_as_ull, assumed,
                        __double_as_longlong(val +
                               __longlong_as_double(assumed)));

    // Note: uses integer comparison to avoid hang in case of NaN (since NaN != NaN)
    } while (assumed != old);

    return __longlong_as_double(old);
}
#endif

There are system-wide and block-wide variants of the following device-wide atomic APIs, with the following exceptions:
以下是设备范围原子 API 的系统范围和块范围变体,以下是以下例外情况:

  • Devices with compute capability less than 6.0 only support device-wide atomic operations,
    计算能力低于 6.0 的设备仅支持设备范围的原子操作。

  • Tegra devices with compute capability less than 7.2 do not support system-wide atomic operations.
    Tegra 设备的计算能力低于 7.2 的设备不支持系统范围的原子操作。

7.14.1. Arithmetic Functions
7.14.1. 算术函数 

7.14.1.1. atomicAdd()

int atomicAdd(int* address, int val);
unsigned int atomicAdd(unsigned int* address,
                       unsigned int val);
unsigned long long int atomicAdd(unsigned long long int* address,
                                 unsigned long long int val);
float atomicAdd(float* address, float val);
double atomicAdd(double* address, double val);
__half2 atomicAdd(__half2 *address, __half2 val);
__half atomicAdd(__half *address, __half val);
__nv_bfloat162 atomicAdd(__nv_bfloat162 *address, __nv_bfloat162 val);
__nv_bfloat16 atomicAdd(__nv_bfloat16 *address, __nv_bfloat16 val);
float2 atomicAdd(float2* address, float2 val);
float4 atomicAdd(float4* address, float4 val);

reads the 16-bit, 32-bit or 64-bit old located at the address address in global or shared memory, computes (old + val), and stores the result back to memory at the same address. These three operations are performed in one atomic transaction. The function returns old.
读取全局或共享内存中地址 address 处的 16 位、32 位或 64 位 old ,计算 (old + val) ,并将结果存储回内存的同一地址。这三个操作在一个原子事务中执行。函数返回 old

The 32-bit floating-point version of atomicAdd() is only supported by devices of compute capability 2.x and higher.
atomicAdd() 的 32 位浮点版本仅受到计算能力为 2.x 及更高设备的支持。

The 64-bit floating-point version of atomicAdd() is only supported by devices of compute capability 6.x and higher.
atomicAdd() 的 64 位浮点版本仅受到计算能力为 6.x 及更高设备的支持。

The 32-bit __half2 floating-point version of atomicAdd() is only supported by devices of compute capability 6.x and higher. The atomicity of the __half2 or __nv_bfloat162 add operation is guaranteed separately for each of the two __half or __nv_bfloat16 elements; the entire __half2 or __nv_bfloat162 is not guaranteed to be atomic as a single 32-bit access.
32 位 __half2 浮点版本的 atomicAdd() 仅受支持于计算能力为 6.x 及更高的设备。对于每个两个 __half__nv_bfloat16 元素, __half2__nv_bfloat162 加法操作的原子性是分开保证的;整个 __half2__nv_bfloat162 不保证作为单个 32 位访问是原子的。

The float2 and float4 floating-point vector versions of atomicAdd() are only supported by devices of compute capability 9.x and higher. The atomicity of the float2 or float4 add operation is guaranteed separately for each of the two or four float elements; the entire float2 or float4 is not guaranteed to be atomic as a single 64-bit or 128-bit access.
仅支持计算能力为 9.x 及更高的设备的 atomicAdd()float2float4 浮点向量版本。对于每个两个或四个 float 元素, float2float4 加法操作的原子性是分别保证的;整个 float2float4 不保证作为单个 64 位或 128 位访问是原子的。

The 16-bit __half floating-point version of atomicAdd() is only supported by devices of compute capability 7.x and higher.
16 位 __half 浮点版本的 atomicAdd() 仅受到计算能力为 7.x 及更高设备的支持。

The 16-bit __nv_bfloat16 floating-point version of atomicAdd() is only supported by devices of compute capability 8.x and higher.
16 位 __nv_bfloat16 浮点版本的 atomicAdd() 仅受支持于计算能力为 8.x 及更高的设备。

The float2 and float4 floating-point vector versions of atomicAdd() are only supported by devices of compute capability 9.x and higher.
atomicAdd()float2float4 浮点向量版本仅受支持于计算能力为 9.x 及更高的设备。

The float2 and float4 floating-point vector versions of atomicAdd() are only supported for global memory addresses.
atomicAdd()float2float4 浮点向量版本仅支持全局内存地址。

7.14.1.2. atomicSub()

int atomicSub(int* address, int val);
unsigned int atomicSub(unsigned int* address,
                       unsigned int val);

reads the 32-bit word old located at the address address in global or shared memory, computes (old - val), and stores the result back to memory at the same address. These three operations are performed in one atomic transaction. The function returns old.
读取全局或共享内存中地址 address 处的 32 位字 old ,计算 (old - val) ,并将结果存储回内存到相同地址。这三个操作在一个原子事务中执行。该函数返回 old

7.14.1.3. atomicExch()

int atomicExch(int* address, int val);
unsigned int atomicExch(unsigned int* address,
                        unsigned int val);
unsigned long long int atomicExch(unsigned long long int* address,
                                  unsigned long long int val);
float atomicExch(float* address, float val);

reads the 32-bit or 64-bit word old located at the address address in global or shared memory and stores val back to memory at the same address. These two operations are performed in one atomic transaction. The function returns old.
读取全局或共享内存中地址 address 处的 32 位或 64 位字 old ,并将 val 存储回内存到相同地址。这两个操作在一个原子事务中执行。该函数返回 old

template<typename T> T atomicExch(T* address, T val);

reads the 128-bit word old located at the address address in global or shared memory and stores val back to memory at the same address. These two operations are performed in one atomic transaction. The function returns old. The type T must meet the following requirements:
读取全局或共享内存中地址 address 处的 128 位字 old ,并将 val 存储回内存到相同地址。这两个操作在一个原子事务中执行。该函数返回 old 。类型 T 必须满足以下要求:

sizeof(T) == 16
alignof(T) >= 16
std::is_trivially_copyable<T>::value == true
// for C++03 and older
std::is_default_constructible<T>::value == true

So, T must be 128-bit and properly aligned, be trivially copyable, and on C++03 or older, it must also be default constructible.
因此, T 必须是 128 位且正确对齐,可以被简单地复制,并且在 C++03 或更早版本中,它还必须是默认可构造的。

The 128-bit atomicExch() is only supported by devices of compute capability 9.x and higher.
128 位 atomicExch() 仅受到计算能力为 9.x 及更高的设备支持。

7.14.1.4. atomicMin()

int atomicMin(int* address, int val);
unsigned int atomicMin(unsigned int* address,
                       unsigned int val);
unsigned long long int atomicMin(unsigned long long int* address,
                                 unsigned long long int val);
long long int atomicMin(long long int* address,
                                long long int val);

reads the 32-bit or 64-bit word old located at the address address in global or shared memory, computes the minimum of old and val, and stores the result back to memory at the same address. These three operations are performed in one atomic transaction. The function returns old.
读取全局或共享内存中地址 address 处的 32 位或 64 位字 old ,计算 oldval 的最小值,并将结果存储回内存相同地址。这三个操作在一个原子事务中执行。函数返回 old

The 64-bit version of atomicMin() is only supported by devices of compute capability 5.0 and higher.
atomicMin() 的 64 位版本仅受支持于计算能力为 5.0 及更高的设备。

7.14.1.5. atomicMax()

int atomicMax(int* address, int val);
unsigned int atomicMax(unsigned int* address,
                       unsigned int val);
unsigned long long int atomicMax(unsigned long long int* address,
                                 unsigned long long int val);
long long int atomicMax(long long int* address,
                                 long long int val);

reads the 32-bit or 64-bit word old located at the address address in global or shared memory, computes the maximum of old and val, and stores the result back to memory at the same address. These three operations are performed in one atomic transaction. The function returns old.
读取全局或共享内存中地址 address 处的 32 位或 64 位字 old ,计算 oldval 的最大值,并将结果存储回内存的同一地址。这三个操作在一个原子事务中执行。该函数返回 old

The 64-bit version of atomicMax() is only supported by devices of compute capability 5.0 and higher.
atomicMax() 的 64 位版本仅受支持于计算能力为 5.0 及更高的设备。

7.14.1.6. atomicInc()

unsigned int atomicInc(unsigned int* address,
                       unsigned int val);

reads the 32-bit word old located at the address address in global or shared memory, computes ((old >= val) ? 0 : (old+1)), and stores the result back to memory at the same address. These three operations are performed in one atomic transaction. The function returns old.
读取全局或共享内存中地址 address 处的 32 位字 old ,计算 ((old >= val) ? 0 : (old+1)) ,并将结果存储回内存到相同地址。这三个操作在一个原子事务中执行。该函数返回 old

7.14.1.7. atomicDec()

unsigned int atomicDec(unsigned int* address,
                       unsigned int val);

reads the 32-bit word old located at the address address in global or shared memory, computes (((old == 0) || (old > val)) ? val : (old-1) ), and stores the result back to memory at the same address. These three operations are performed in one atomic transaction. The function returns old.
读取全局或共享内存中地址 address 处的 32 位字 old ,计算 (((old == 0) || (old > val)) ? val : (old-1) ,并将结果存储回相同地址的内存。这三个操作在一个原子事务中执行。该函数返回 old

7.14.1.8. atomicCAS()

int atomicCAS(int* address, int compare, int val);
unsigned int atomicCAS(unsigned int* address,
                       unsigned int compare,
                       unsigned int val);
unsigned long long int atomicCAS(unsigned long long int* address,
                                 unsigned long long int compare,
                                 unsigned long long int val);
unsigned short int atomicCAS(unsigned short int *address,
                             unsigned short int compare,
                             unsigned short int val);

reads the 16-bit, 32-bit or 64-bit word old located at the address address in global or shared memory, computes (old == compare ? val : old), and stores the result back to memory at the same address. These three operations are performed in one atomic transaction. The function returns old (Compare And Swap).
读取全局或共享内存中地址 address 处的 16 位、32 位或 64 位字 old ,计算 (old == compare ? val : old) ,并将结果存储回内存的同一地址。这三个操作在一个原子事务中执行。该函数返回 old (比较并交换)。

template<typename T> T atomicCAS(T* address, T compare, T val);

reads the 128-bit word old located at the address address in global or shared memory, computes (old == compare ? val : old), and stores the result back to memory at the same address. These three operations are performed in one atomic transaction. The function returns old (Compare And Swap). The type T must meet the following requirements:
读取全局或共享内存中地址为 address 处的 128 位字 old ,计算 (old == compare ? val : old) ,并将结果存储回内存的同一地址。这三个操作在一个原子事务中执行。该函数返回 old (比较并交换)。类型 T 必须满足以下要求:

sizeof(T) == 16
alignof(T) >= 16
std::is_trivially_copyable<T>::value == true
// for C++03 and older
std::is_default_constructible<T>::value == true

So, T must be 128-bit and properly aligned, be trivially copyable, and on C++03 or older, it must also be default constructible.
因此, T 必须是 128 位且正确对齐,可以被简单地复制,并且在 C++03 或更早版本中,它还必须是默认可构造的。

The 128-bit atomicCAS() is only supported by devices of compute capability 9.x and higher.
128 位 atomicCAS() 仅受到计算能力为 9.x 及更高的设备支持。

7.14.2. Bitwise Functions
7.14.2. 位运算函数 

7.14.2.1. atomicAnd()

int atomicAnd(int* address, int val);
unsigned int atomicAnd(unsigned int* address,
                       unsigned int val);
unsigned long long int atomicAnd(unsigned long long int* address,
                                 unsigned long long int val);

reads the 32-bit or 64-bit word old located at the address address in global or shared memory, computes (old & val), and stores the result back to memory at the same address. These three operations are performed in one atomic transaction. The function returns old.
读取全局或共享内存中地址 address 处的 32 位或 64 位字 old ,计算 (old & val ,并将结果存储回内存的相同地址。这三个操作在一个原子事务中执行。该函数返回 old

The 64-bit version of atomicAnd() is only supported by devices of compute capability 5.0 and higher.
atomicAnd() 的 64 位版本仅受支持于计算能力为 5.0 及更高的设备。

7.14.2.2. atomicOr()

int atomicOr(int* address, int val);
unsigned int atomicOr(unsigned int* address,
                      unsigned int val);
unsigned long long int atomicOr(unsigned long long int* address,
                                unsigned long long int val);

reads the 32-bit or 64-bit word old located at the address address in global or shared memory, computes (old | val), and stores the result back to memory at the same address. These three operations are performed in one atomic transaction. The function returns old.
读取全局或共享内存中地址 address 处的 32 位或 64 位字 old ,计算 (old | val) ,并将结果存储回内存到相同地址。这三个操作在一个原子事务中执行。该函数返回 old

The 64-bit version of atomicOr() is only supported by devices of compute capability 5.0 and higher.
atomicOr() 的 64 位版本仅受支持于计算能力为 5.0 及更高的设备。

7.14.2.3. atomicXor()

int atomicXor(int* address, int val);
unsigned int atomicXor(unsigned int* address,
                       unsigned int val);
unsigned long long int atomicXor(unsigned long long int* address,
                                 unsigned long long int val);

reads the 32-bit or 64-bit word old located at the address address in global or shared memory, computes (old ^ val), and stores the result back to memory at the same address. These three operations are performed in one atomic transaction. The function returns old.
读取全局或共享内存中地址 address 处的 32 位或 64 位字 old ,计算 (old ^ val) ,并将结果存储回内存到相同地址。这三个操作在一个原子事务中执行。该函数返回 old

The 64-bit version of atomicXor() is only supported by devices of compute capability 5.0 and higher.
atomicXor() 的 64 位版本仅受支持于计算能力为 5.0 及更高的设备。

7.15. Address Space Predicate Functions
7.15. 地址空间谓词函数 

The functions described in this section have unspecified behavior if the argument is a null pointer.
如果参数是空指针,则本节中描述的函数具有未指定的行为。

7.15.1. __isGlobal()

__device__ unsigned int __isGlobal(const void *ptr);

Returns 1 if ptr contains the generic address of an object in global memory space, otherwise returns 0.
如果 ptr 包含全局内存空间中对象的通用地址,则返回 1,否则返回 0。

7.15.2. __isShared()

__device__ unsigned int __isShared(const void *ptr);

Returns 1 if ptr contains the generic address of an object in shared memory space, otherwise returns 0.
如果 ptr 包含共享内存空间中对象的通用地址,则返回 1,否则返回 0。

7.15.3. __isConstant()

__device__ unsigned int __isConstant(const void *ptr);

Returns 1 if ptr contains the generic address of an object in constant memory space, otherwise returns 0.
如果 ptr 包含常量内存空间中对象的通用地址,则返回 1,否则返回 0。

7.15.4. __isGridConstant()

__device__ unsigned int __isGridConstant(const void *ptr);

Returns 1 if ptr contains the generic address of a kernel parameter annotated with __grid_constant__, otherwise returns 0. Only supported for compute architectures greater than or equal to 7.x or later.
如果 ptr 包含用 __grid_constant__ 注释的内核参数的通用地址,则返回 1,否则返回 0。仅支持大于或等于 7.x 或更高版本的计算架构。

7.15.5. __isLocal()

__device__ unsigned int __isLocal(const void *ptr);

Returns 1 if ptr contains the generic address of an object in local memory space, otherwise returns 0.
如果 ptr 包含本地内存空间中对象的通用地址,则返回 1,否则返回 0。

7.16. Address Space Conversion Functions
7.16. 地址空间转换函数 

7.16.1. __cvta_generic_to_global()

__device__ size_t __cvta_generic_to_global(const void *ptr);

Returns the result of executing the PTXcvta.to.global instruction on the generic address denoted by ptr.
返回在由 ptr 表示的通用地址上执行 PTX cvta.to.global 指令的结果。

7.16.2. __cvta_generic_to_shared()

__device__ size_t __cvta_generic_to_shared(const void *ptr);

Returns the result of executing the PTXcvta.to.shared instruction on the generic address denoted by ptr.
返回在由 ptr 表示的通用地址上执行 PTX cvta.to.shared 指令的结果。

7.16.3. __cvta_generic_to_constant()

__device__ size_t __cvta_generic_to_constant(const void *ptr);

Returns the result of executing the PTXcvta.to.const instruction on the generic address denoted by ptr.
返回在由 ptr 表示的通用地址上执行 PTX cvta.to.const 指令的结果。

7.16.4. __cvta_generic_to_local()

__device__ size_t __cvta_generic_to_local(const void *ptr);

Returns the result of executing the PTXcvta.to.local instruction on the generic address denoted by ptr.
返回在由 ptr 表示的通用地址上执行 PTX cvta.to.local 指令的结果。

7.16.5. __cvta_global_to_generic()

__device__ void * __cvta_global_to_generic(size_t rawbits);

Returns the generic pointer obtained by executing the PTXcvta.global instruction on the value provided by rawbits.
通过在由 rawbits 提供的值上执行 PTX cvta.global 指令获得的通用指针。

7.16.6. __cvta_shared_to_generic()

__device__ void * __cvta_shared_to_generic(size_t rawbits);

Returns the generic pointer obtained by executing the PTXcvta.shared instruction on the value provided by rawbits.
通过在由 rawbits 提供的值上执行 PTX cvta.shared 指令获得的通用指针。

7.16.7. __cvta_constant_to_generic()

__device__ void * __cvta_constant_to_generic(size_t rawbits);

Returns the generic pointer obtained by executing the PTXcvta.const instruction on the value provided by rawbits.
通过在由 rawbits 提供的值上执行 PTX cvta.const 指令获得的通用指针。

7.16.8. __cvta_local_to_generic()

__device__ void * __cvta_local_to_generic(size_t rawbits);

Returns the generic pointer obtained by executing the PTXcvta.local instruction on the value provided by rawbits.
通过在由 rawbits 提供的值上执行 PTX cvta.local 指令获得的通用指针。

7.17. Alloca Function
7.17. Alloca 函数 

7.17.1. Synopsis 7.17.1. 概要 

__host__ __device__ void * alloca(size_t size);

7.17.2. Description 7.17.2. 描述 

The alloca() function allocates size bytes of memory in the stack frame of the caller. The returned value is a pointer to allocated memory, the beginning of the memory is 16 bytes aligned when the function is invoked from device code. The allocated memory is automatically freed when the caller to alloca() is returned.
alloca() 函数在调用者的堆栈帧中分配 size 字节的内存。返回的值是指向分配内存的指针,当从设备代码调用函数时,内存的起始位置是按 16 字节对齐的。当调用者返回到 alloca() 时,分配的内存会自动释放。

Note 注意

On Windows platform, <malloc.h> must be included before using alloca(). Using alloca() may cause the stack to overflow, user needs to adjust stack size accordingly.
在 Windows 平台上,在使用 alloca() 之前必须包含 <malloc.h> 。使用 alloca() 可能会导致堆栈溢出,用户需要相应调整堆栈大小。

It is supported with compute capability 5.2 or higher.
它支持计算能力为 5.2 或更高的设备。

7.17.3. Example 7.17.3. 示例 

__device__ void foo(unsigned int num) {
    int4 *ptr = (int4 *)alloca(num * sizeof(int4));
    // use of ptr
    ...
}

7.18. Compiler Optimization Hint Functions
7.18. 编译器优化提示函数 

The functions described in this section can be used to provide additional information to the compiler optimizer.
本节中描述的函数可用于向编译器优化器提供额外信息。

7.18.1. __builtin_assume_aligned()

void * __builtin_assume_aligned (const void *exp, size_t align)

Allows the compiler to assume that the argument pointer is aligned to at least align bytes, and returns the argument pointer.
允许编译器假定参数指针至少对齐到 align 字节,并返回参数指针。

Example: 示例:

void *res = __builtin_assume_aligned(ptr, 32); // compiler can assume 'res' is
                                               // at least 32-byte aligned

Three parameter version:
三个参数版本:

void * __builtin_assume_aligned (const void *exp, size_t align,
                                 <integral type> offset)

Allows the compiler to assume that (char *)exp - offset is aligned to at least align bytes, and returns the argument pointer.
允许编译器假定 (char *)exp - offset 至少对齐到 align 字节,并返回参数指针。

Example: 示例:

void *res = __builtin_assume_aligned(ptr, 32, 8); // compiler can assume
                                                  // '(char *)res - 8' is
                                                  // at least 32-byte aligned.

7.18.2. __builtin_assume()

void __builtin_assume(bool exp)

Allows the compiler to assume that the Boolean argument is true. If the argument is not true at run time, then the behavior is undefined. Note that if the argument has side effects, the behavior is unspecified.
允许编译器假定布尔参数为真。如果参数在运行时不为真,则行为是未定义的。请注意,如果参数具有副作用,则行为是未指定的。

Example: 示例:

 __device__ int get(int *ptr, int idx) {
   __builtin_assume(idx <= 2);
   return ptr[idx];
}

7.18.3. __assume()

void __assume(bool exp)

Allows the compiler to assume that the Boolean argument is true. If the argument is not true at run time, then the behavior is undefined. Note that if the argument has side effects, the behavior is unspecified.
允许编译器假定布尔参数为真。如果参数在运行时不为真,则行为是未定义的。请注意,如果参数具有副作用,则行为是未指定的。

Example: 示例:

 __device__ int get(int *ptr, int idx) {
   __assume(idx <= 2);
   return ptr[idx];
}

7.18.4. __builtin_expect()

long __builtin_expect (long exp, long c)

Indicates to the compiler that it is expected that exp == c, and returns the value of exp. Typically used to indicate branch prediction information to the compiler.
指示编译器预期 exp == c ,并返回 exp 的值。通常用于向编译器指示分支预测信息。

Example: 示例:

// indicate to the compiler that likely "var == 0",
// so the body of the if-block is unlikely to be
// executed at run time.
if (__builtin_expect (var, 0))
  doit ();

7.18.5. __builtin_unreachable()

void __builtin_unreachable(void)

Indicates to the compiler that control flow never reaches the point where this function is being called from. The program has undefined behavior if the control flow does actually reach this point at run time.
指示编译器控制流永远不会到达调用此函数的地方。如果控制流实际上在运行时到达此点,程序的行为是未定义的。

Example: 示例:

// indicates to the compiler that the default case label is never reached.
switch (in) {
case 1: return 4;
case 2: return 10;
default: __builtin_unreachable();
}

7.18.6. Restrictions 7.18.6. 限制 

__assume() is only supported when using cl.exe host compiler. The other functions are supported on all platforms, subject to the following restrictions:
__assume() 仅在使用 cl.exe 主机编译器时支持。其他功能在所有平台上都受支持,但受以下限制:

  • If the host compiler supports the function, the function can be invoked from anywhere in translation unit.
    如果主机编译器支持该函数,则可以从翻译单元的任何位置调用该函数。

  • Otherwise, the function must be invoked from within the body of a __device__/ __global__function, or only when the __CUDA_ARCH__ macro is defined12.
    否则,该函数必须在 __device__ / __global__ 函数的主体内调用,或者仅当定义了 __CUDA_ARCH__ 宏时才能调用。

7.19. Warp Vote Functions
7.19. Warp 投票功能 

int __all_sync(unsigned mask, int predicate);
int __any_sync(unsigned mask, int predicate);
unsigned __ballot_sync(unsigned mask, int predicate);
unsigned __activemask();

Deprecation notice: __any, __all, and __ballot have been deprecated in CUDA 9.0 for all devices.
弃用通知:CUDA 9.0 中已弃用所有设备的 __any__all__ballot

Removal notice: When targeting devices with compute capability 7.x or higher, __any, __all, and __ballot are no longer available and their sync variants should be used instead.
移除通知:当针对计算能力为 7.x 或更高的设备时, __any__all__ballot 不再可用,应改用它们的同步变体。

The warp vote functions allow the threads of a given warp to perform a reduction-and-broadcast operation. These functions take as input an integer predicate from each thread in the warp and compare those values with zero. The results of the comparisons are combined (reduced) across the active threads of the warp in one of the following ways, broadcasting a single return value to each participating thread:
warp 投票函数允许给定 warp 的线程执行归约和广播操作。这些函数将来自 warp 中每个线程的整数 predicate 作为输入,并将这些值与零进行比较。比较的结果以以下一种方式在 warp 的活动线程中合并(归约),向每个参与线程广播单个返回值:

__all_sync(unsigned mask, predicate):

Evaluate predicate for all non-exited threads in mask and return non-zero if and only if predicate evaluates to non-zero for all of them.
mask 中所有未退出线程评估 predicate ,仅当所有线程的 predicate 评估为非零时返回非零。

__any_sync(unsigned mask, predicate):

Evaluate predicate for all non-exited threads in mask and return non-zero if and only if predicate evaluates to non-zero for any of them.
mask 中所有未退出线程评估 predicate ,仅当其中任何一个评估为非零时返回非零。

__ballot_sync(unsigned mask, predicate):

Evaluate predicate for all non-exited threads in mask and return an integer whose Nth bit is set if and only if predicate evaluates to non-zero for the Nth thread of the warp and the Nth thread is active.
mask 中所有非退出线程评估 predicate ,并返回一个整数,仅当第 N 个线程处于活动状态且 predicate 对 warp 的第 N 个线程评估为非零时,第 N 位设置。

__activemask():

Returns a 32-bit integer mask of all currently active threads in the calling warp. The Nth bit is set if the Nth lane in the warp is active when __activemask() is called. Inactive threads are represented by 0 bits in the returned mask. Threads which have exited the program are always marked as inactive. Note that threads that are convergent at an __activemask() call are not guaranteed to be convergent at subsequent instructions unless those instructions are synchronizing warp-builtin functions.
返回调用 warp 中所有当前活动线程的 32 位整数掩码。当调用 __activemask() 时,第 N 位设置为 1 表示 warp 中第 N 个 lane 处于活动状态。返回掩码中用 0 位表示非活动线程。已退出程序的线程始终被标记为非活动。请注意,调用 __activemask() 时收敛的线程不能保证在后续指令中保持收敛,除非这些指令是同步的 warp 内置函数。

For __all_sync, __any_sync, and __ballot_sync, a mask must be passed that specifies the threads participating in the call. A bit, representing the thread’s lane ID, must be set for each participating thread to ensure they are properly converged before the intrinsic is executed by the hardware. Each calling thread must have its own bit set in the mask and all non-exited threads named in mask must execute the same intrinsic with the same mask, or the result is undefined.
对于 __all_sync__any_sync__ballot_sync ,必须传递一个指定参与调用的线程的掩码。必须为每个参与的线程设置一个位,表示线程的 lane ID,以确保它们在硬件执行内部函数之前正确汇聚。每个调用线程必须在掩码中设置自己的位,并且所有在掩码中命名的未退出线程必须使用相同的掩码执行相同的内部函数,否则结果是未定义的。

These intrinsics do not imply a memory barrier. They do not guarantee any memory ordering.
这些内置函数不意味着内存屏障。它们不保证任何内存排序。

7.20. Warp Match Functions
7.20. Warp 匹配函数 

__match_any_sync and __match_all_sync perform a broadcast-and-compare operation of a variable between threads within a warp.
__match_any_sync__match_all_sync 在 warp 内的线程之间执行变量的广播比较操作。

Supported by devices of compute capability 7.x or higher.
支持计算能力为 7.x 或更高的设备。

7.20.1. Synopsis 7.20.1. 概要 

unsigned int __match_any_sync(unsigned mask, T value);
unsigned int __match_all_sync(unsigned mask, T value, int *pred);

T can be int, unsigned int, long, unsigned long, long long, unsigned long long, float or double.
T 可以是 intunsigned intlongunsigned longlong longunsigned long longfloatdouble

7.20.2. Description 7.20.2. 描述 

The __match_sync() intrinsics permit a broadcast-and-compare of a value value across threads in a warp after synchronizing threads named in mask.
__match_sync() 内在函数允许在同步了在 mask 中命名的线程之后,在线程组中广播和比较值 value

__match_any_sync

Returns mask of threads that have same value of value in mask
返回具有相同 maskvalue 值的线程掩码

__match_all_sync

Returns mask if all threads in mask have the same value for value; otherwise 0 is returned. Predicate pred is set to true if all threads in mask have the same value of value; otherwise the predicate is set to false.
如果 mask 中的所有线程对 value 具有相同值,则返回 mask ;否则返回 0。如果 mask 中的所有线程对 value 具有相同值,则将谓词 pred 设置为 true;否则将谓词设置为 false。

The new *_sync match intrinsics take in a mask indicating the threads participating in the call. A bit, representing the thread’s lane id, must be set for each participating thread to ensure they are properly converged before the intrinsic is executed by the hardware. Each calling thread must have its own bit set in the mask and all non-exited threads named in mask must execute the same intrinsic with the same mask, or the result is undefined.
新的 *_sync 匹配内在函数接受一个掩码,指示参与调用的线程。对于每个参与线程,必须设置一个位,表示线程的 lane id,以确保在硬件执行内在函数之前它们被正确汇聚。每个调用线程在掩码中必须设置自己的位,并且所有在掩码中命名的未退出线程必须使用相同的掩码执行相同的内在函数,否则结果是未定义的。

These intrinsics do not imply a memory barrier. They do not guarantee any memory ordering.
这些内置函数不意味着内存屏障。它们不保证任何内存排序。

7.21. Warp Reduce Functions
7.21. Warp Reduce Functions  7.21. 绕组减少函数 

The __reduce_sync(unsigned mask, T value) intrinsics perform a reduction operation on the data provided in value after synchronizing threads named in mask. T can be unsigned or signed for {add, min, max} and unsigned only for {and, or, xor} operations.
__reduce_sync(unsigned mask, T value) 内在函数在同步名为 mask 的线程后,对 value 中提供的数据执行减少操作。对于 {add, min, max} 操作,T 可以是无符号或有符号的,而对于 {and, or, xor} 操作,T 只能是无符号的。

Supported by devices of compute capability 8.x or higher.
支持计算能力为 8.x 或更高的设备。

7.21.1. Synopsis 7.21.1. 概要 

// add/min/max
unsigned __reduce_add_sync(unsigned mask, unsigned value);
unsigned __reduce_min_sync(unsigned mask, unsigned value);
unsigned __reduce_max_sync(unsigned mask, unsigned value);
int __reduce_add_sync(unsigned mask, int value);
int __reduce_min_sync(unsigned mask, int value);
int __reduce_max_sync(unsigned mask, int value);

// and/or/xor
unsigned __reduce_and_sync(unsigned mask, unsigned value);
unsigned __reduce_or_sync(unsigned mask, unsigned value);
unsigned __reduce_xor_sync(unsigned mask, unsigned value);

7.21.2. Description 7.21.2. 描述 

__reduce_add_sync, __reduce_min_sync, __reduce_max_sync

Returns the result of applying an arithmetic add, min, or max reduction operation on the values provided in value by each thread named in mask.
返回在 mask 中命名的每个线程提供的 value 值上应用算术加法、最小值或最大值缩减操作的结果。

__reduce_and_sync, __reduce_or_sync, __reduce_xor_sync

Returns the result of applying a logical AND, OR, or XOR reduction operation on the values provided in value by each thread named in mask.
返回在 mask 中命名的每个线程提供的值上应用逻辑 AND、OR 或 XOR 缩减操作的结果。

The mask indicates the threads participating in the call. A bit, representing the thread’s lane id, must be set for each participating thread to ensure they are properly converged before the intrinsic is executed by the hardware. Each calling thread must have its own bit set in the mask and all non-exited threads named in mask must execute the same intrinsic with the same mask, or the result is undefined.
mask 表示参与调用的线程。每个参与线程都必须设置一个位,表示线程的 lane id,以确保在硬件执行内部函数之前它们能够正确汇聚。每个调用线程在掩码中都必须设置自己的位,并且所有在掩码中命名的未退出线程必须使用相同的掩码执行相同的内部函数,否则结果是未定义的。

These intrinsics do not imply a memory barrier. They do not guarantee any memory ordering.
这些内置函数不意味着内存屏障。它们不保证任何内存排序。

7.22. Warp Shuffle Functions
7.22. Warp Shuffle 函数 

__shfl_sync, __shfl_up_sync, __shfl_down_sync, and __shfl_xor_sync exchange a variable between threads within a warp.
__shfl_sync__shfl_up_sync__shfl_down_sync__shfl_xor_sync 在 warp 内的线程之间交换变量。

Supported by devices of compute capability 5.0 or higher.
支持计算能力为 5.0 或更高的设备。

Deprecation Notice: __shfl, __shfl_up, __shfl_down, and __shfl_xor have been deprecated in CUDA 9.0 for all devices.
弃用通知:CUDA 9.0 已弃用所有设备上的 __shfl__shfl_up__shfl_down__shfl_xor

Removal Notice: When targeting devices with compute capability 7.x or higher, __shfl, __shfl_up, __shfl_down, and __shfl_xor are no longer available and their sync variants should be used instead.
移除通知:当针对计算能力为 7.x 或更高的设备时, __shfl__shfl_up__shfl_down__shfl_xor 不再可用,应改用它们的同步变体。

7.22.1. Synopsis 7.22.1. 概要 

T __shfl_sync(unsigned mask, T var, int srcLane, int width=warpSize);
T __shfl_up_sync(unsigned mask, T var, unsigned int delta, int width=warpSize);
T __shfl_down_sync(unsigned mask, T var, unsigned int delta, int width=warpSize);
T __shfl_xor_sync(unsigned mask, T var, int laneMask, int width=warpSize);

T can be int, unsigned int, long, unsigned long, long long, unsigned long long, float or double. With the cuda_fp16.h header included, T can also be __half or __half2. Similarly, with the cuda_bf16.h header included, T can also be __nv_bfloat16 or __nv_bfloat162.
T 可以是 intunsigned intlongunsigned longlong longunsigned long longfloatdouble 。包含 cuda_fp16.h 头部时, T 也可以是 __half__half2 。类似地,包含 cuda_bf16.h 头部时, T 也可以是 __nv_bfloat16__nv_bfloat162

7.22.2. Description 7.22.2. 描述 

The __shfl_sync() intrinsics permit exchanging of a variable between threads within a warp without use of shared memory. The exchange occurs simultaneously for all active threads within the warp (and named in mask), moving 4 or 8 bytes of data per thread depending on the type.
__shfl_sync() 内在函数允许在一个 warp 内的线程之间交换变量,而无需使用共享内存。交换同时发生在 warp 内的所有活动线程(并在 mask 中命名),每个线程根据类型移动 4 或 8 字节的数据。

Threads within a warp are referred to as lanes, and may have an index between 0 and warpSize-1 (inclusive). Four source-lane addressing modes are supported:
线程组内的线程被称为 lanes,并且可以具有从 0 到 warpSize-1 (包括)的索引。支持四种源 lane 寻址模式:

__shfl_sync()

Direct copy from indexed lane
直接从索引通道复制

__shfl_up_sync()

Copy from a lane with lower ID relative to caller
从相对于调用者具有较低 ID 的车道复制

__shfl_down_sync()

Copy from a lane with higher ID relative to caller
从相对于调用者具有更高 ID 的通道复制

__shfl_xor_sync()

Copy from a lane based on bitwise XOR of own lane ID
根据自己车道 ID 的按位异或复制自车道

Threads may only read data from another thread which is actively participating in the __shfl_sync() command. If the target thread is inactive, the retrieved value is undefined.
线程只能从另一个正在参与 __shfl_sync() 命令的线程中读取数据。如果目标线程处于非活动状态,则检索到的值是未定义的。

All of the __shfl_sync() intrinsics take an optional width parameter which alters the behavior of the intrinsic. width must have a value which is a power of two in the range [1, warpSize] (i.e., 1, 2, 4, 8, 16 or 32). Results are undefined for other values.
所有 __shfl_sync() 内在函数都接受一个可选的 width 参数,该参数会改变内在函数的行为。 width 必须是范围为[1, warpSize]内的 2 的幂值(即 1, 2, 4, 8, 16 或 32)。对于其他值,结果是未定义的。

__shfl_sync() returns the value of var held by the thread whose ID is given by srcLane. If width is less than warpSize then each subsection of the warp behaves as a separate entity with a starting logical lane ID of 0. If srcLane is outside the range [0:width-1], the value returned corresponds to the value of var held by the srcLane modulo width (i.e. within the same subsection).
__shfl_sync() 返回由 srcLane 给定的线程 ID 持有的 var 的值。如果宽度小于 warpSize ,则 warp 的每个子部分都会像一个独立的实体一样运行,起始逻辑 lane ID 为 0。如果 srcLane 超出范围 [0:width-1] ,则返回的值对应于由 srcLane modulo width 持有的 var 的值(即在同一子部分内)。

__shfl_up_sync() calculates a source lane ID by subtracting delta from the caller’s lane ID. The value of var held by the resulting lane ID is returned: in effect, var is shifted up the warp by delta lanes. If width is less than warpSize then each subsection of the warp behaves as a separate entity with a starting logical lane ID of 0. The source lane index will not wrap around the value of width, so effectively the lower delta lanes will be unchanged.
__shfl_up_sync() 通过从调用者的 lane ID 中减去 delta 来计算源 lane ID。返回结果 lane ID 持有的 var 的值:实际上, vardelta 个 lane 向上 warp 移动。如果 width 小于 warpSize ,则 warp 的每个子部分都将作为一个独立实体,其起始逻辑 lane ID 为 0。源 lane 索引不会绕过 width 的值,因此较低的 delta 个 lane 将保持不变。

__shfl_down_sync() calculates a source lane ID by adding delta to the caller’s lane ID. The value of var held by the resulting lane ID is returned: this has the effect of shifting var down the warp by delta lanes. If width is less than warpSize then each subsection of the warp behaves as a separate entity with a starting logical lane ID of 0. As for __shfl_up_sync(), the ID number of the source lane will not wrap around the value of width and so the upper delta lanes will remain unchanged.
__shfl_down_sync() 通过将 delta 添加到调用者的 lane ID 来计算源 lane ID。返回结果 lane ID 持有的 var 的值:这将导致将 var 向下移动 delta 个 lane。如果宽度小于 warpSize ,则 warp 的每个子部分将作为一个独立实体,其起始逻辑 lane ID 为 0。至于 __shfl_up_sync() ,源 lane 的 ID 编号不会环绕宽度的值,因此上部 delta 个 lane 将保持不变。

__shfl_xor_sync() calculates a source line ID by performing a bitwise XOR of the caller’s lane ID with laneMask: the value of var held by the resulting lane ID is returned. If width is less than warpSize then each group of width consecutive threads are able to access elements from earlier groups of threads, however if they attempt to access elements from later groups of threads their own value of var will be returned. This mode implements a butterfly addressing pattern such as is used in tree reduction and broadcast.
__shfl_xor_sync() 通过执行调用者的通道 ID 与 laneMask 的按位异或来计算源行 ID:返回由结果通道 ID 持有的 var 的值。如果 width 小于 warpSize ,则每组 width 个连续线程能够访问较早组线程的元素,但如果它们尝试访问较晚组线程的元素,则将返回它们自己的 var 的值。此模式实现了蝴蝶寻址模式,例如在树归约和广播中使用。

The new *_sync shfl intrinsics take in a mask indicating the threads participating in the call. A bit, representing the thread’s lane id, must be set for each participating thread to ensure they are properly converged before the intrinsic is executed by the hardware. Each calling thread must have its own bit set in the mask and all non-exited threads named in mask must execute the same intrinsic with the same mask, or the result is undefined.
新的 *_sync shfl 内在函数接受一个掩码,指示参与调用的线程。对于每个参与的线程,必须设置一个位,表示线程的 lane id,以确保在硬件执行内在函数之前它们被正确汇聚。每个调用线程在掩码中必须设置自己的位,并且所有在掩码中命名的未退出线程必须使用相同的掩码执行相同的内在函数,否则结果是未定义的。

Threads may only read data from another thread which is actively participating in the __shfl_sync() command. If the target thread is inactive, the retrieved value is undefined.
线程只能从另一个正在参与 __shfl_sync() 命令的线程中读取数据。如果目标线程处于非活动状态,则检索到的值是未定义的。

These intrinsics do not imply a memory barrier. They do not guarantee any memory ordering.
这些内置函数不意味着内存屏障。它们不保证任何内存排序。

7.22.3. Examples 7.22.3. 示例 

7.22.3.1. Broadcast of a single value across a warp
7.22.3.1. 在 warp 中广播单个值 

#include <stdio.h>

__global__ void bcast(int arg) {
    int laneId = threadIdx.x & 0x1f;
    int value;
    if (laneId == 0)        // Note unused variable for
        value = arg;        // all threads except lane 0
    value = __shfl_sync(0xffffffff, value, 0);   // Synchronize all threads in warp, and get "value" from lane 0
    if (value != arg)
        printf("Thread %d failed.\n", threadIdx.x);
}

int main() {
    bcast<<< 1, 32 >>>(1234);
    cudaDeviceSynchronize();

    return 0;
}

7.22.3.2. Inclusive plus-scan across sub-partitions of 8 threads
7.22.3.2. 在 8 个线程的子分区上进行包含性加扫描 

#include <stdio.h>

__global__ void scan4() {
    int laneId = threadIdx.x & 0x1f;
    // Seed sample starting value (inverse of lane ID)
    int value = 31 - laneId;

    // Loop to accumulate scan within my partition.
    // Scan requires log2(n) == 3 steps for 8 threads
    // It works by an accumulated sum up the warp
    // by 1, 2, 4, 8 etc. steps.
    for (int i=1; i<=4; i*=2) {
        // We do the __shfl_sync unconditionally so that we
        // can read even from threads which won't do a
        // sum, and then conditionally assign the result.
        int n = __shfl_up_sync(0xffffffff, value, i, 8);
        if ((laneId & 7) >= i)
            value += n;
    }

    printf("Thread %d final value = %d\n", threadIdx.x, value);
}

int main() {
    scan4<<< 1, 32 >>>();
    cudaDeviceSynchronize();

    return 0;
}

7.22.3.3. Reduction across a warp
7.22.3.3. 在一个 warp 中的减少 

#include <stdio.h>

__global__ void warpReduce() {
    int laneId = threadIdx.x & 0x1f;
    // Seed starting value as inverse lane ID
    int value = 31 - laneId;

    // Use XOR mode to perform butterfly reduction
    for (int i=16; i>=1; i/=2)
        value += __shfl_xor_sync(0xffffffff, value, i, 32);

    // "value" now contains the sum across all threads
    printf("Thread %d final value = %d\n", threadIdx.x, value);
}

int main() {
    warpReduce<<< 1, 32 >>>();
    cudaDeviceSynchronize();

    return 0;
}

7.23. Nanosleep Function
7.23. Nanosleep 函数 

7.23.1. Synopsis 7.23.1. 概要 

void __nanosleep(unsigned ns);

7.23.2. Description 7.23.2. 描述 

__nanosleep(ns) suspends the thread for a sleep duration of approximately ns nanoseconds. The maximum sleep duration is approximately 1 millisecond.
__nanosleep(ns) 暂停线程,休眠大约 ns 纳秒。最大休眠时间约为 1 毫秒。

It is supported with compute capability 7.0 or higher.
它支持计算能力为 7.0 或更高的设备。

7.23.3. Example 7.23.3. 示例 

The following code implements a mutex with exponential back-off.
以下代码实现了带有指数退避的互斥锁。

__device__ void mutex_lock(unsigned int *mutex) {
    unsigned int ns = 8;
    while (atomicCAS(mutex, 0, 1) == 1) {
        __nanosleep(ns);
        if (ns < 256) {
            ns *= 2;
        }
    }
}

__device__ void mutex_unlock(unsigned int *mutex) {
    atomicExch(mutex, 0);
}

7.24. Warp Matrix Functions
7.24. 弯曲矩阵函数 

C++ warp matrix operations leverage Tensor Cores to accelerate matrix problems of the form D=A*B+C. These operations are supported on mixed-precision floating point data for devices of compute capability 7.0 or higher. This requires co-operation from all threads in a warp. In addition, these operations are allowed in conditional code only if the condition evaluates identically across the entire warp, otherwise the code execution is likely to hang.
C++ 弯曲矩阵操作利用张量核心加速形式为 D=A*B+C 的矩阵问题。这些操作支持计算能力为 7.0 或更高的设备上的混合精度浮点数据。这需要所有线程在一个弯曲中合作。此外,这些操作仅在条件代码中允许,只有当条件在整个弯曲中评估相同时,否则代码执行可能会挂起。

7.24.1. Description 7.24.1. 描述 

All following functions and types are defined in the namespace nvcuda::wmma. Sub-byte operations are considered preview, i.e. the data structures and APIs for them are subject to change and may not be compatible with future releases. This extra functionality is defined in the nvcuda::wmma::experimental namespace.
以下所有函数和类型都在命名空间 nvcuda::wmma 中定义。子字节操作被视为预览,即它们的数据结构和 API 可能会发生变化,并且可能与将来的版本不兼容。这些额外功能在 nvcuda::wmma::experimental 命名空间中定义。

template<typename Use, int m, int n, int k, typename T, typename Layout=void> class fragment;

void load_matrix_sync(fragment<...> &a, const T* mptr, unsigned ldm);
void load_matrix_sync(fragment<...> &a, const T* mptr, unsigned ldm, layout_t layout);
void store_matrix_sync(T* mptr, const fragment<...> &a, unsigned ldm, layout_t layout);
void fill_fragment(fragment<...> &a, const T& v);
void mma_sync(fragment<...> &d, const fragment<...> &a, const fragment<...> &b, const fragment<...> &c, bool satf=false);
fragment

An overloaded class containing a section of a matrix distributed across all threads in the warp. The mapping of matrix elements into fragment internal storage is unspecified and subject to change in future architectures.
一个包含矩阵部分的过载类,分布在 warp 中的所有线程上。矩阵元素到 fragment 内部存储的映射未指定,并且可能在未来架构中发生变化。

Only certain combinations of template arguments are allowed. The first template parameter specifies how the fragment will participate in the matrix operation. Acceptable values for Use are:
仅允许特定组合的模板参数。第一个模板参数指定片段将如何参与矩阵操作。 Use 的可接受值为:

  • matrix_a when the fragment is used as the first multiplicand, A,
    matrix_a 当片段用作第一个乘数时, A

  • matrix_b when the fragment is used as the second multiplicand, B, or
    matrix_b 当片段用作第二个乘数时, B ,或

  • accumulator when the fragment is used as the source or destination accumulators (C or D, respectively).
    accumulator 当片段用作源或目的累加器(分别为 CD )时。

    The m, n and k sizes describe the shape of the warp-wide matrix tiles participating in the multiply-accumulate operation. The dimension of each tile depends on its role. For matrix_a the tile takes dimension m x k; for matrix_b the dimension is k x n, and accumulator tiles are m x n.
    mnk 尺寸描述参与乘积累加操作的经向矩阵瓦片的形状。每个瓦片的维度取决于其角色。对于 matrix_a ,瓦片的维度为 m x k ;对于 matrix_b ,维度为 k x naccumulator 个瓦片为 m x n

    The data type, T, may be double, float, __half, __nv_bfloat16, char, or unsigned char for multiplicands and double, float, int, or __half for accumulators. As documented in Element Types and Matrix Sizes, limited combinations of accumulator and multiplicand types are supported. The Layout parameter must be specified for matrix_a and matrix_b fragments. row_major or col_major indicate that elements within a matrix row or column are contiguous in memory, respectively. The Layout parameter for an accumulator matrix should retain the default value of void. A row or column layout is specified only when the accumulator is loaded or stored as described below.
    数据类型 T 可以是 doublefloat__half__nv_bfloat16charunsigned char 用于乘数, doublefloatint__half 用于累加器。如《元素类型和矩阵大小》中所述,支持有限的累加器和乘数类型组合。必须为 matrix_amatrix_b 片段指定布局参数。 row_majorcol_major 表示矩阵行或列内的元素在内存中是连续的。对于 accumulator 矩阵, Layout 参数应保留默认值 void 。仅当累加器按照以下描述加载或存储时才指定行或列布局。

load_matrix_sync

Waits until all warp lanes have arrived at load_matrix_sync and then loads the matrix fragment a from memory. mptr must be a 256-bit aligned pointer pointing to the first element of the matrix in memory. ldm describes the stride in elements between consecutive rows (for row major layout) or columns (for column major layout) and must be a multiple of 8 for __half element type or multiple of 4 for float element type. (i.e., multiple of 16 bytes in both cases). If the fragment is an accumulator, the layout argument must be specified as either mem_row_major or mem_col_major. For matrix_a and matrix_b fragments, the layout is inferred from the fragment’s layout parameter. The values of mptr, ldm, layout and all template parameters for a must be the same for all threads in the warp. This function must be called by all threads in the warp, or the result is undefined.
等待所有 warp 通道到达 load_matrix_sync,然后从内存加载矩阵片段 a。 mptr 必须是指向内存中矩阵第一个元素的 256 位对齐指针。 ldm 描述连续行(行主布局)或列(列主布局)之间的元素跨度,并且对于 __half 元素类型必须是 8 的倍数,对于 float 元素类型必须是 4 的倍数(即在这两种情况下都是 16 字节的倍数)。如果片段是 accumulator ,则 layout 参数必须指定为 mem_row_majormem_col_major 之一。对于 matrix_amatrix_b 片段,布局是从片段的 layout 参数中推断出的。 mptrldmlayout 的值以及 a 的所有模板参数必须对所有 warp 中的所有线程相同。必须由 warp 中的所有线程调用此函数,否则结果是未定义的。

store_matrix_sync

Waits until all warp lanes have arrived at store_matrix_sync and then stores the matrix fragment a to memory. mptr must be a 256-bit aligned pointer pointing to the first element of the matrix in memory. ldm describes the stride in elements between consecutive rows (for row major layout) or columns (for column major layout) and must be a multiple of 8 for __half element type or multiple of 4 for float element type. (i.e., multiple of 16 bytes in both cases). The layout of the output matrix must be specified as either mem_row_major or mem_col_major. The values of mptr, ldm, layout and all template parameters for a must be the same for all threads in the warp.
等待直到所有 warp 通道都到达 store_matrix_sync,然后将矩阵片段 a 存储到内存中。 mptr 必须是指向内存中矩阵第一个元素的 256 位对齐指针。 ldm 描述了连续行(行主布局)或列(列主布局)之间的元素跨度,并且对于 __half 元素类型必须是 8 的倍数,对于 float 元素类型必须是 4 的倍数(即在这两种情况下都是 16 字节的倍数)。输出矩阵的布局必须指定为 mem_row_majormem_col_majormptrldmlayout 的值以及 a 的所有模板参数必须对于 warp 中的所有线程都是相同的。

fill_fragment

Fill a matrix fragment with a constant value v. Because the mapping of matrix elements to each fragment is unspecified, this function is ordinarily called by all threads in the warp with a common value for v.
使用常量值 v 填充矩阵片段。由于矩阵元素到每个片段的映射未指定,因此通常由 warp 中所有线程调用此函数,使用相同的 v 值。

mma_sync

Waits until all warp lanes have arrived at mma_sync, and then performs the warp-synchronous matrix multiply-accumulate operation D=A*B+C. The in-place operation, C=A*B+C, is also supported. The value of satf and template parameters for each matrix fragment must be the same for all threads in the warp. Also, the template parameters m, n and k must match between fragments A, B, C and D. This function must be called by all threads in the warp, or the result is undefined.
等待直到所有 warp 通道都到达 mma_sync,然后执行 warp 同步矩阵乘积累加操作 D=A*B+C 。也支持原地操作 C=A*B+C 。每个矩阵片段的值和模板参数必须对所有 warp 中的线程相同。此外,模板参数 mnk 必须在片段 ABCD 之间匹配。此函数必须被 warp 中的所有线程调用,否则结果是未定义的。

If satf (saturate to finite value) mode is true, the following additional numerical properties apply for the destination accumulator:
如果 satf (饱和到有限值)模式为 true ,则目标累加器的以下附加数值属性适用:

  • If an element result is +Infinity, the corresponding accumulator will contain +MAX_NORM
    如果元素结果为+Infinity,则相应的累加器将包含 +MAX_NORM

  • If an element result is -Infinity, the corresponding accumulator will contain -MAX_NORM
    如果元素结果为-Infinity,则相应的累加器将包含 -MAX_NORM

  • If an element result is NaN, the corresponding accumulator will contain +0
    如果元素结果为 NaN,则相应的累加器将包含 +0

Because the map of matrix elements into each thread’s fragment is unspecified, individual matrix elements must be accessed from memory (shared or global) after calling store_matrix_sync. In the special case where all threads in the warp will apply an element-wise operation uniformly to all fragment elements, direct element access can be implemented using the following fragment class members.
由于将矩阵元素映射到每个线程的 fragment 是未指定的,因此在调用 store_matrix_sync 后必须从内存(共享或全局)中访问单个矩阵元素。在所有线程将对所有片段元素统一应用逐元素操作的特殊情况下,可以使用以下 fragment 类成员实现直接元素访问。

enum fragment<Use, m, n, k, T, Layout>::num_elements;
T fragment<Use, m, n, k, T, Layout>::x[num_elements];

As an example, the following code scales an accumulator matrix tile by half.
例如,以下代码将一个 accumulator 矩阵块缩放一半。

wmma::fragment<wmma::accumulator, 16, 16, 16, float> frag;
float alpha = 0.5f; // Same value for all threads in warp
/*...*/
for(int t=0; t<frag.num_elements; t++)
frag.x[t] *= alpha;

7.24.2. Alternate Floating Point
7.24.2. 替代浮点数 

Tensor Cores support alternate types of floating point operations on devices with compute capability 8.0 and higher.
张量核心支持在计算能力为 8.0 及更高的设备上进行替代类型的浮点运算。

__nv_bfloat16

This data format is an alternate fp16 format that has the same range as f32 but reduced precision (7 bits). You can use this data format directly with the __nv_bfloat16 type available in cuda_bf16.h. Matrix fragments with __nv_bfloat16 data types are required to be composed with accumulators of float type. The shapes and operations supported are the same as with __half.
此数据格式是一种备用的 fp16 格式,其范围与 f32 相同,但精度降低(7 位)。您可以直接使用 cuda_bf16.h 中可用的 __nv_bfloat16 类型的数据格式。需要使用 __nv_bfloat16 数据类型的矩阵片段与 float 类型的累加器组合。支持的形状和操作与 __half 相同。

tf32

This data format is a special floating point format supported by Tensor Cores, with the same range as f32 and reduced precision (>=10 bits). The internal layout of this format is implementation defined. In order to use this floating point format with WMMA operations, the input matrices must be manually converted to tf32 precision.
此数据格式是张量核支持的一种特殊浮点格式,具有与 f32 相同的范围和降低的精度(>=10 位)。此格式的内部布局是实现定义的。为了将此浮点格式与 WMMA 操作一起使用,输入矩阵必须手动转换为 tf32 精度。

To facilitate conversion, a new intrinsic __float_to_tf32 is provided. While the input and output arguments to the intrinsic are of float type, the output will be tf32 numerically. This new precision is intended to be used with Tensor Cores only, and if mixed with other floattype operations, the precision and range of the result will be undefined.
为了方便转换,提供了一个新的内在 __float_to_tf32 。虽然内在的输入和输出参数是 float 类型,但输出将是 tf32 数值。这种新的精度仅用于与张量核心一起使用,如果与其他 float 类型操作混合使用,则结果的精度和范围将是未定义的。

Once an input matrix (matrix_a or matrix_b) is converted to tf32 precision, the combination of a fragment with precision::tf32 precision, and a data type of float to load_matrix_sync will take advantage of this new capability. Both the accumulator fragments must have float data types. The only supported matrix size is 16x16x8 (m-n-k).
一旦输入矩阵( matrix_amatrix_b )转换为 tf32 精度, fragmentprecision::tf32 精度的组合,以及 floatload_matrix_sync 的数据类型将利用这一新功能。累加器片段都必须具有 float 数据类型。唯一支持的矩阵大小为 16x16x8(m-n-k)。

The elements of the fragment are represented as float, hence the mapping from element_type<T> to storage_element_type<T> is:
片段的元素表示为 float ,因此从 element_type<T>storage_element_type<T> 的映射是:

precision::tf32 -> float

7.24.3. Double Precision
7.24.3. 双精度 

Tensor Cores support double-precision floating point operations on devices with compute capability 8.0 and higher. To use this new functionality, a fragment with the double type must be used. The mma_sync operation will be performed with the .rn (rounds to nearest even) rounding modifier.
张量核心支持计算能力为 8.0 及更高版本的设备上的双精度浮点运算。要使用这个新功能,必须使用 double 类型的 fragment 。将使用.rn(四舍五入到最近的偶数)舍入修饰符执行 mma_sync 操作。

7.24.4. Sub-byte Operations
7.24.4. 子字节操作 

Sub-byte WMMA operations provide a way to access the low-precision capabilities of Tensor Cores. They are considered a preview feature i.e. the data structures and APIs for them are subject to change and may not be compatible with future releases. This functionality is available via the nvcuda::wmma::experimental namespace:
Sub-byte WMMA 操作提供了一种访问张量核心低精度功能的方式。它们被视为预览功能,即它们的数据结构和 API 可能会发生变化,并且可能与将来的版本不兼容。此功能可通过 nvcuda::wmma::experimental 命名空间访问:

namespace experimental {
    namespace precision {
        struct u4; // 4-bit unsigned
        struct s4; // 4-bit signed
        struct b1; // 1-bit
   }
    enum bmmaBitOp {
        bmmaBitOpXOR = 1, // compute_75 minimum
        bmmaBitOpAND = 2  // compute_80 minimum
    };
    enum bmmaAccumulateOp { bmmaAccumulateOpPOPC = 1 };
}

For 4 bit precision, the APIs available remain the same, but you must specify experimental::precision::u4 or experimental::precision::s4 as the fragment data type. Since the elements of the fragment are packed together, num_storage_elements will be smaller than num_elements for that fragment. The num_elements variable for a sub-byte fragment, hence returns the number of elements of sub-byte type element_type<T>. This is true for single bit precision as well, in which case, the mapping from element_type<T> to storage_element_type<T> is as follows:
对于 4 位精度,可用的 API 保持不变,但必须将 experimental::precision::u4experimental::precision::s4 指定为片段数据类型。由于片段的元素被打包在一起,因此该片段的 num_storage_elements 将小于 num_elements 。对于子字节片段, num_elements 变量因此返回子字节类型 element_type<T> 的元素数量。对于单比特精度也是如此,在这种情况下,从 element_type<T>storage_element_type<T> 的映射如下:

experimental::precision::u4 -> unsigned (8 elements in 1 storage element)
experimental::precision::s4 -> int (8 elements in 1 storage element)
experimental::precision::b1 -> unsigned (32 elements in 1 storage element)
T -> T  //all other types

The allowed layouts for sub-byte fragments is always row_major for matrix_a and col_major for matrix_b.
子字节片段的允许布局始终为 row_major 对于 matrix_acol_major 对于 matrix_b

For sub-byte operations the value of ldm in load_matrix_sync should be a multiple of 32 for element type experimental::precision::u4 and experimental::precision::s4 or a multiple of 128 for element type experimental::precision::b1 (i.e., multiple of 16 bytes in both cases).
对于子字节操作, load_matrix_sync 中的 ldm 值应为元素类型 experimental::precision::u4experimental::precision::s4 的 32 的倍数,或者为元素类型 experimental::precision::b1 的 128 的倍数(即在这两种情况下都是 16 字节的倍数)。

Note 注意

Support for the following variants for MMA instructions is deprecated and will be removed in sm_90:
对 MMA 指令的以下变体支持已被弃用,并将在 sm_90 中移除:

  • experimental::precision::u4

  • experimental::precision::s4

  • experimental::precision::b1 with bmmaBitOp set to bmmaBitOpXOR
    experimental::precision::b1 设置为 bmmaBitOpXORbmmaBitOp

bmma_sync

Waits until all warp lanes have executed bmma_sync, and then performs the warp-synchronous bit matrix multiply-accumulate operation D = (A op B) + C, where op consists of a logical operation bmmaBitOp followed by the accumulation defined by bmmaAccumulateOp. The available operations are:
等待所有 warp 通道执行 bmma_sync,然后执行 warp 同步的位矩阵乘累加操作 D = (A op B) + C ,其中 op 由逻辑操作 bmmaBitOpbmmaAccumulateOp 定义的累加操作组成。可用的操作有:

bmmaBitOpXOR, a 128-bit XOR of a row in matrix_a with the 128-bit column of matrix_b
bmmaBitOpXORmatrix_a 中一行的 128 位 XOR 与 matrix_b 的 128 位列进行异或操作

bmmaBitOpAND, a 128-bit AND of a row in matrix_a with the 128-bit column of matrix_b, available on devices with compute capability 8.0 and higher.
bmmaBitOpANDmatrix_a 中一行的 128 位 AND 与 matrix_b 的 128 位列的结果,在具有计算能力 8.0 及更高版本的设备上可用。

The accumulate op is always bmmaAccumulateOpPOPC which counts the number of set bits.
累积操作始终为 bmmaAccumulateOpPOPC ,用于计算设置位的数量。

7.24.5. Restrictions 7.24.5. 限制 

The special format required by tensor cores may be different for each major and minor device architecture. This is further complicated by threads holding only a fragment (opaque architecture-specific ABI data structure) of the overall matrix, with the developer not allowed to make assumptions on how the individual parameters are mapped to the registers participating in the matrix multiply-accumulate.
张量核心所需的特殊格式可能因每个主要和次要设备架构而异。这一复杂性进一步增加,因为线程仅持有整体矩阵的片段(不透明的特定架构 ABI 数据结构),开发人员不允许假设个别参数如何映射到参与矩阵乘累加的寄存器。

Since fragments are architecture-specific, it is unsafe to pass them from function A to function B if the functions have been compiled for different link-compatible architectures and linked together into the same device executable. In this case, the size and layout of the fragment will be specific to one architecture and using WMMA APIs in the other will lead to incorrect results or potentially, corruption.
由于片段是特定于架构的,如果函数 A 和函数 B 已经为不同的链接兼容架构编译,并链接到同一个设备可执行文件中,则将它们从函数 A 传递到函数 B 是不安全的。在这种情况下,片段的大小和布局将特定于一种架构,在另一种架构中使用 WMMA API 将导致不正确的结果或潜在的损坏。

An example of two link-compatible architectures, where the layout of the fragment differs, is sm_70 and sm_75.
两个链接兼容架构的示例,其中片段的布局不同,分别是 sm_70 和 sm_75。

fragA.cu: void foo() { wmma::fragment<...> mat_a; bar(&mat_a); }
fragB.cu: void bar(wmma::fragment<...> *mat_a) { // operate on mat_a }
// sm_70 fragment layout
$> nvcc -dc -arch=compute_70 -code=sm_70 fragA.cu -o fragA.o
// sm_75 fragment layout
$> nvcc -dc -arch=compute_75 -code=sm_75 fragB.cu -o fragB.o
// Linking the two together
$> nvcc -dlink -arch=sm_75 fragA.o fragB.o -o frag.o

This undefined behavior might also be undetectable at compilation time and by tools at runtime, so extra care is needed to make sure the layout of the fragments is consistent. This linking hazard is most likely to appear when linking with a legacy library that is both built for a different link-compatible architecture and expecting to be passed a WMMA fragment.
这种未定义行为在编译时和运行时的工具中也可能无法检测到,因此需要额外小心确保片段的布局是一致的。当与一个为不同链接兼容架构构建且期望传递一个 WMMA 片段的传统库进行链接时,这种链接风险最有可能出现。

Note that in the case of weak linkages (for example, a CUDA C++ inline function), the linker may choose any available function definition which may result in implicit passes between compilation units.
请注意,在弱链接的情况下(例如,CUDA C++ 内联函数),链接器可能会选择任何可用的函数定义,这可能导致编译单元之间的隐式传递。

To avoid these sorts of problems, the matrix should always be stored out to memory for transit through external interfaces (e.g. wmma::store_matrix_sync(dst, …);) and then it can be safely passed to bar() as a pointer type [e.g. float *dst].
为避免这些问题,矩阵应始终存储到内存中,以便通过外部接口进行传输(例如 wmma::store_matrix_sync(dst, …); ),然后可以安全地将其作为指针类型传递给 bar() [例如 float *dst ]。

Note that since sm_70 can run on sm_75, the above example sm_75 code can be changed to sm_70 and correctly work on sm_75. However, it is recommended to have sm_75 native code in your application when linking with other sm_75 separately compiled binaries.
请注意,由于 sm_70 可以在 sm_75 上运行,上述示例中的 sm_75 代码可以更改为 sm_70,并在 sm_75 上正确运行。但是,建议在与其他单独编译的 sm_75 二进制文件链接时,在您的应用程序中具有 sm_75 本机代码。

7.24.6. Element Types and Matrix Sizes
7.24.6. 元素类型和矩阵大小 

Tensor Cores support a variety of element types and matrix sizes. The following table presents the various combinations of matrix_a, matrix_b and accumulator matrix supported:
Tensor Cores 支持各种元素类型和矩阵大小。以下表格显示支持的各种 matrix_amatrix_baccumulator 矩阵的组合:

Matrix A 矩阵 A

Matrix B 矩阵 B

Accumulator 累加器

Matrix Size (m-n-k) 矩阵大小(m-n-k)

__half

__half

float 浮点数

16x16x16

__half

__half

float 浮点数

32x8x16

__half

__half

float 浮点数

8x32x16

__half

__half

__half

16x16x16

__half

__half

__half

32x8x16

__half

__half

__half

8x32x16

unsigned char 无符号字符

unsigned char 无符号字符

int 整数

16x16x16

unsigned char 无符号字符

unsigned char 无符号字符

int 整数

32x8x16

unsigned char 无符号字符

unsigned char 无符号字符

int 整数

8x32x16

signed char 有符号字符

signed char 有符号字符

int 整数

16x16x16

signed char 有符号字符

signed char 有符号字符

int 整数

32x8x16

signed char 有符号字符

signed char 有符号字符

int 整数

8x32x16

Alternate Floating Point support:
替代浮点支持:

Matrix A 矩阵 A

Matrix B 矩阵 B

Accumulator 累加器

Matrix Size (m-n-k) 矩阵大小(m-n-k)

__nv_bfloat16

__nv_bfloat16

float 浮点数

16x16x16

__nv_bfloat16

__nv_bfloat16

float 浮点数

32x8x16

__nv_bfloat16

__nv_bfloat16

float 浮点数

8x32x16

precision::tf32

precision::tf32

float 浮点数

16x16x8

Double Precision Support:
双精度支持:

Matrix A 矩阵 A

Matrix B 矩阵 B

Accumulator 累加器

Matrix Size (m-n-k) 矩阵大小(m-n-k)

double 双精度

double 双精度

double 双精度

8x8x4

Experimental support for sub-byte operations:
对子字节操作的实验性支持:

Matrix A 矩阵 A

Matrix B 矩阵 B

Accumulator 累加器

Matrix Size (m-n-k) 矩阵大小(m-n-k)

precision::u4 精度::u4

precision::u4 精度::u4

int 整数

8x8x32

precision::s4

precision::s4

int 整数

8x8x32

precision::b1 精度::b1

precision::b1 精度::b1

int 整数

8x8x128

7.24.7. Example 7.24.7. 示例 

The following code implements a 16x16x16 matrix multiplication in a single warp.
以下代码在单个 warp 中实现了一个 16x16x16 矩阵乘法。

#include <mma.h>
using namespace nvcuda;

__global__ void wmma_ker(half *a, half *b, float *c) {
   // Declare the fragments
   wmma::fragment<wmma::matrix_a, 16, 16, 16, half, wmma::col_major> a_frag;
   wmma::fragment<wmma::matrix_b, 16, 16, 16, half, wmma::row_major> b_frag;
   wmma::fragment<wmma::accumulator, 16, 16, 16, float> c_frag;

   // Initialize the output to zero
   wmma::fill_fragment(c_frag, 0.0f);

   // Load the inputs
   wmma::load_matrix_sync(a_frag, a, 16);
   wmma::load_matrix_sync(b_frag, b, 16);

   // Perform the matrix multiplication
   wmma::mma_sync(c_frag, a_frag, b_frag, c_frag);

   // Store the output
   wmma::store_matrix_sync(c, c_frag, 16, wmma::mem_row_major);
}

7.25. DPX

DPX is a set of functions that enable finding min and max values, as well as fused addition and min/max, for up to three 16 and 32-bit signed or unsigned integer parameters, with optional ReLU (clamping to zero):
DPX 是一组函数,可用于查找最小和最大值,以及融合加法和最小/最大值,最多可处理三个 16 位和 32 位有符号或无符号整数参数,可选择使用 ReLU(夹紧到零)

  • three parameters: __vimax3_s32, __vimax3_s16x2, __vimax3_u32, __vimax3_u16x2, __vimin3_s32, __vimin3_s16x2, __vimin3_u32, __vimin3_u16x2
    三个参数: __vimax3_s32 , __vimax3_s16x2 , __vimax3_u32 , __vimax3_u16x2 , __vimin3_s32 , __vimin3_s16x2 , __vimin3_u32 , __vimin3_u16x2

  • two parameters, with ReLU: __vimax_s32_relu, __vimax_s16x2_relu, __vimin_s32_relu, __vimin_s16x2_relu
    带 ReLU 的两个参数: __vimax_s32_relu__vimax_s16x2_relu__vimin_s32_relu__vimin_s16x2_relu

  • three parameters, with ReLU: __vimax3_s32_relu, __vimax3_s16x2_relu, __vimin3_s32_relu, __vimin3_s16x2_relu
    带有 ReLU 的三个参数: __vimax3_s32_relu__vimax3_s16x2_relu__vimin3_s32_relu__vimin3_s16x2_relu

  • two parameters, also returning which parameter was smaller/larger: __vibmax_s32, __vibmax_u32, __vibmin_s32, __vibmin_u32, __vibmax_s16x2, __vibmax_u16x2, __vibmin_s16x2, __vibmin_u16x2
    两个参数,还返回哪个参数较小/较大: __vibmax_s32__vibmax_u32__vibmin_s32__vibmin_u32__vibmax_s16x2__vibmax_u16x2__vibmin_s16x2__vibmin_u16x2

  • three parameters, comparing (first + second) with the third: __viaddmax_s32, __viaddmax_s16x2, __viaddmax_u32, __viaddmax_u16x2, __viaddmin_s32, __viaddmin_s16x2, __viaddmin_u32, __viaddmin_u16x2
    三个参数,比较(第一个 + 第二个)与第三个: __viaddmax_s32__viaddmax_s16x2__viaddmax_u32__viaddmax_u16x2__viaddmin_s32__viaddmin_s16x2__viaddmin_u32__viaddmin_u16x2

  • three parameters, with ReLU, comparing (first + second) with the third and a zero: __viaddmax_s32_relu, __viaddmax_s16x2_relu, __viaddmin_s32_relu, __viaddmin_s16x2_relu
    三个参数,带有 ReLU,将(第一个+第二个)与第三个和零进行比较: __viaddmax_s32_relu__viaddmax_s16x2_relu__viaddmin_s32_relu__viaddmin_s16x2_relu

These instructions are hardware-accelerated on devices with compute capability 9 and higher, and software emulation on older devices.
这些指令在具有计算能力 9 及更高版本的设备上进行硬件加速,而在较旧的设备上进行软件仿真。

Full API can be found in CUDA Math API documentation.
完整的 API 可在 CUDA Math API 文档中找到。

DPX is exceptionally useful when implementing dynamic programming algorithms, such as Smith-Waterman or Needleman–Wunsch in genomics and Floyd-Warshall in route optimization.
DPX 在实现动态规划算法时非常有用,比如在基因组学中的 Smith-Waterman 或 Needleman-Wunsch,以及在路由优化中的 Floyd-Warshall。

7.25.1. Examples 7.25.1. 示例 

Max value of three signed 32-bit integers, with ReLU
使用 ReLU 函数计算三个有符号 32 位整数的最大值

const int a = -15;
const int b = 8;
const int c = 5;
int max_value_0 = __vimax3_s32_relu(a, b, c); // max(-15, 8, 5, 0) = 8
const int d = -2;
const int e = -4;
int max_value_1 = __vimax3_s32_relu(a, d, e); // max(-15, -2, -4, 0) = 0

Min value of the sum of two 32-bit signed integers, another 32-bit signed integer and a zero (ReLU)
两个 32 位有符号整数之和的最小值,另一个 32 位有符号整数和零的和(ReLU)

const int a = -5;
const int b = 6;
const int c = -2;
int max_value_0 = __viaddmax_s32_relu(a, b, c); // max(-5 + 6, -2, 0) = max(1, -2, 0) = 1
const int d = 4;
int max_value_1 = __viaddmax_s32_relu(a, d, c); // max(-5 + 4, -2, 0) = max(-1, -2, 0) = 0

Min value of two unsigned 32-bit integers and determining which value is smaller
两个无符号 32 位整数的最小值,并确定哪个值较小

const unsigned int a = 9;
const unsigned int b = 6;
bool smaller_value;
unsigned int min_value = __vibmin_u32(a, b, &smaller_value); // min_value is 6, smaller_value is true

Max values of three pairs of unsigned 16-bit integers
三对无符号 16 位整数的最大值

const unsigned a = 0x00050002;
const unsigned b = 0x00070004;
const unsigned c = 0x00020006;
unsigned int max_value = __vimax3_u16x2(a, b, c); // max(5, 7, 2) and max(2, 4, 6), so max_value is 0x00070006

7.26. Asynchronous Barrier
7.26. 异步屏障 

The NVIDIA C++ standard library introduces a GPU implementation of std::barrier. Along with the implementation of std::barrier the library provides extensions that allow users to specify the scope of barrier objects. The barrier API scopes are documented under Thread Scopes. Devices of compute capability 8.0 or higher provide hardware acceleration for barrier operations and integration of these barriers with the memcpy_async feature. On devices with compute capability below 8.0 but starting 7.0, these barriers are available without hardware acceleration.
NVIDIA C++ 标准库引入了 std::barrier 的 GPU 实现。除了 std::barrier 的实现外,该库还提供了允许用户指定 barrier 对象范围的扩展。barrier API 范围在线程范围下有文档记录。计算能力为 8.0 或更高的设备为 barrier 操作提供了硬件加速,并将这些 barrier 与 memcpy_async 功能集成。在计算能力低于 8.0 但从 7.0 开始的设备上,这些 barrier 可以在没有硬件加速的情况下使用。

nvcuda::experimental::awbarrier is deprecated in favor of cuda::barrier.
nvcuda::experimental::awbarrier 已被弃用,建议使用 cuda::barrier

7.26.1. Simple Synchronization Pattern
7.26.1. 简单同步模式 

Without the arrive/wait barrier, synchronization is achieved using __syncthreads() (to synchronize all threads in a block) or group.sync() when using Cooperative Groups.
没有到达/等待屏障,同步是通过 __syncthreads() (用于同步块中的所有线程)或者 group.sync() (在使用协作组时)来实现的。

#include <cooperative_groups.h>

__global__ void simple_sync(int iteration_count) {
    auto block = cooperative_groups::this_thread_block();

    for (int i = 0; i < iteration_count; ++i) {
        /* code before arrive */
        block.sync(); /* wait for all threads to arrive here */
        /* code after wait */
    }
}

Threads are blocked at the synchronization point (block.sync()) until all threads have reached the synchronization point. In addition, memory updates that happened before the synchronization point are guaranteed to be visible to all threads in the block after the synchronization point, i.e., equivalent to atomic_thread_fence(memory_order_seq_cst, thread_scope_block) as well as the sync.
线程在同步点( block.sync() )被阻塞,直到所有线程都到达同步点。此外,在同步点之前发生的内存更新保证对同步点之后块中的所有线程可见,即等同于 atomic_thread_fence(memory_order_seq_cst, thread_scope_block) 以及 sync

This pattern has three stages:
这个模式有三个阶段:

  • Code before sync performs memory updates that will be read after the sync.
    同步之前的代码执行了将在同步后读取的内存更新。

  • Synchronization point 同步点

  • Code after sync point with visibility of memory updates that happened before sync point.
    同步点之后的代码,可以看到在同步点之前发生的内存更新。

7.26.2. Temporal Splitting and Five Stages of Synchronization
7.26.2. 时间分割和五个同步阶段 

The temporally-split synchronization pattern with the std::barrier is as follows.
使用 std::barrier 的时间分割同步模式如下。

#include <cuda/barrier>
#include <cooperative_groups.h>

__device__ void compute(float* data, int curr_iteration);

__global__ void split_arrive_wait(int iteration_count, float *data) {
    using barrier = cuda::barrier<cuda::thread_scope_block>;
    __shared__  barrier bar;
    auto block = cooperative_groups::this_thread_block();

    if (block.thread_rank() == 0) {
        init(&bar, block.size()); // Initialize the barrier with expected arrival count
    }
    block.sync();

    for (int curr_iter = 0; curr_iter < iteration_count; ++curr_iter) {
        /* code before arrive */
       barrier::arrival_token token = bar.arrive(); /* this thread arrives. Arrival does not block a thread */
       compute(data, curr_iter);
       bar.wait(std::move(token)); /* wait for all threads participating in the barrier to complete bar.arrive()*/
        /* code after wait */
    }
}

In this pattern, the synchronization point (block.sync()) is split into an arrive point (bar.arrive()) and a wait point (bar.wait(std::move(token))). A thread begins participating in a cuda::barrier with its first call to bar.arrive(). When a thread calls bar.wait(std::move(token)) it will be blocked until participating threads have completed bar.arrive() the expected number of times as specified by the expected arrival count argument passed to init(). Memory updates that happen before participating threads’ call to bar.arrive() are guaranteed to be visible to participating threads after their call to bar.wait(std::move(token)). Note that the call to bar.arrive() does not block a thread, it can proceed with other work that does not depend upon memory updates that happen before other participating threads’ call to bar.arrive().
在这种模式中,同步点( block.sync() )被分成到达点( bar.arrive() )和等待点( bar.wait(std::move(token)) )。线程开始参与 cuda::barrier ,并在第一次调用 bar.arrive() 时。当线程调用 bar.wait(std::move(token)) 时,它将被阻塞,直到参与线程完成 bar.arrive() 次,这是由传递给 init() 的预期到达计数参数指定的。在参与线程调用 bar.arrive() 之前发生的内存更新保证在参与线程调用 bar.wait(std::move(token)) 后对参与线程可见。请注意,对 bar.arrive() 的调用不会阻塞线程,它可以继续进行不依赖于在其他参与线程调用 bar.arrive() 之前发生的内存更新的其他工作。

The arrive and then wait pattern has five stages which may be iteratively repeated:
到达然后等待模式有五个阶段,可以进行迭代重复:

  • Code before arrive performs memory updates that will be read after the wait.
    在到达之前的代码执行了将在等待后读取的内存更新。

  • Arrive point with implicit memory fence (i.e., equivalent to atomic_thread_fence(memory_order_seq_cst, thread_scope_block)).
    到达点与隐式内存栅栏(即等效于 atomic_thread_fence(memory_order_seq_cst, thread_scope_block) )。

  • Code between arrive and wait.
    到达和等待之间的代码。

  • Wait point. 等待点。

  • Code after the wait, with visibility of updates that were performed before the arrive.
    等待后的代码,可以看到到达之前执行的更新。

7.26.3. Bootstrap Initialization, Expected Arrival Count, and Participation
7.26.3. Bootstrap 初始化、预期到达计数和参与者 

Initialization must happen before any thread begins participating in a cuda::barrier.
初始化必须在任何线程开始参与 cuda::barrier 之前发生。

#include <cuda/barrier>
#include <cooperative_groups.h>

__global__ void init_barrier() {
    __shared__ cuda::barrier<cuda::thread_scope_block> bar;
    auto block = cooperative_groups::this_thread_block();

    if (block.thread_rank() == 0) {
        init(&bar, block.size()); // Single thread initializes the total expected arrival count.
    }
    block.sync();
}

Before any thread can participate in cuda::barrier, the barrier must be initialized using init() with an expected arrival count, block.size() in this example. Initialization must happen before any thread calls bar.arrive(). This poses a bootstrapping challenge in that threads must synchronize before participating in the cuda::barrier, but threads are creating a cuda::barrier in order to synchronize. In this example, threads that will participate are part of a cooperative group and use block.sync() to bootstrap initialization. In this example a whole thread block is participating in initialization, hence __syncthreads() could also be used.
在任何线程可以参与 cuda::barrier 之前,必须使用 init() 对屏障进行初始化,其中包括预期到达计数 block.size() 。初始化必须在任何线程调用 bar.arrive() 之前发生。这带来了一个引导挑战,即线程必须在参与 cuda::barrier 之前进行同步,但线程正在创建一个 cuda::barrier 以便同步。在这个示例中,将参与的线程是协作组的一部分,并使用 block.sync() 来引导初始化。在这个示例中,整个线程块参与初始化,因此也可以使用 __syncthreads()

The second parameter of init() is the expected arrival count, i.e., the number of times bar.arrive() will be called by participating threads before a participating thread is unblocked from its call to bar.wait(std::move(token)). In the prior example the cuda::barrier is initialized with the number of threads in the thread block i.e., cooperative_groups::this_thread_block().size(), and all threads within the thread block participate in the barrier.
init() 的第二个参数是预期到达计数,即在参与线程调用 bar.wait(std::move(token)) 之前, bar.arrive() 将被调用的次数。在先前的示例中, cuda::barrier 初始化为线程块中的线程数,即 cooperative_groups::this_thread_block().size() ,线程块中的所有线程都参与屏障。

A cuda::barrier is flexible in specifying how threads participate (split arrive/wait) and which threads participate. In contrast this_thread_block.sync() from cooperative groups or __syncthreads() is applicable to whole-thread-block and __syncwarp(mask) is a specified subset of a warp. If the intention of the user is to synchronize a full thread block or a full warp we recommend using __syncthreads() and __syncwarp(mask) respectively for performance reasons.
一个 cuda::barrier 在指定线程如何参与(分裂到达/等待)以及哪些线程参与方面具有灵活性。相比之下, this_thread_block.sync() 来自协作组或 __syncthreads() 适用于整个线程块, __syncwarp(mask) 是一个指定的 warp 子集。如果用户的意图是同步整个线程块或整个 warp,出于性能原因我们建议分别使用 __syncthreads()__syncwarp(mask)

7.26.4. A Barrier’s Phase: Arrival, Countdown, Completion, and Reset
7.26.4. 障碍的阶段:到达、倒计时、完成和重置 

A cuda::barrier counts down from the expected arrival count to zero as participating threads call bar.arrive(). When the countdown reaches zero, a cuda::barrier is complete for the current phase. When the last call to bar.arrive() causes the countdown to reach zero, the countdown is automatically and atomically reset. The reset assigns the countdown to the expected arrival count, and moves the cuda::barrier to the next phase.
一个 cuda::barrier 从预期到达计数开始倒计时,直到参与线程调用 bar.arrive() 为止。当倒计时达到零时,当前阶段的 cuda::barrier 完成。当最后一次调用 bar.arrive() 导致倒计时达到零时,倒计时会自动且原子性地重置。重置会将倒计时分配给预期到达计数,并将 cuda::barrier 移至下一个阶段。

A token object of class cuda::barrier::arrival_token, as returned from token=bar.arrive(), is associated with the current phase of the barrier. A call to bar.wait(std::move(token)) blocks the calling thread while the cuda::barrier is in the current phase, i.e., while the phase associated with the token matches the phase of the cuda::barrier. If the phase is advanced (because the countdown reaches zero) before the call to bar.wait(std::move(token)) then the thread does not block; if the phase is advanced while the thread is blocked in bar.wait(std::move(token)), the thread is unblocked.
一个 token 类的对象,从 token=bar.arrive() 返回,与屏障的当前阶段相关联。调用 bar.wait(std::move(token)) 会阻塞调用线程,直到 cuda::barrier 处于当前阶段,即令牌相关联的阶段与 cuda::barrier 的阶段匹配。如果在调用 bar.wait(std::move(token)) 之前阶段被提前(因为倒计时达到零),则线程不会被阻塞;如果在线程在 bar.wait(std::move(token)) 中被阻塞时阶段被提前,线程将被解除阻塞。

It is essential to know when a reset could or could not occur, especially in non-trivial arrive/wait synchronization patterns.
了解何时可能发生重置以及何时不会发生重置是至关重要的,特别是在非平凡的到达/等待同步模式中。

  • A thread’s calls to token=bar.arrive() and bar.wait(std::move(token)) must be sequenced such that token=bar.arrive() occurs during the cuda::barrier’s current phase, and bar.wait(std::move(token)) occurs during the same or next phase.
    线程对 token=bar.arrive()bar.wait(std::move(token)) 的调用必须按顺序进行,以便在 cuda::barrier 的当前阶段发生 token=bar.arrive() ,并且在相同或下一个阶段发生 bar.wait(std::move(token))

  • A thread’s call to bar.arrive() must occur when the barrier’s counter is non-zero. After barrier initialization, if a thread’s call to bar.arrive() causes the countdown to reach zero then a call to bar.wait(std::move(token)) must happen before the barrier can be reused for a subsequent call to bar.arrive().
    线程对 bar.arrive() 的调用必须在屏障计数器非零时发生。在屏障初始化后,如果线程对 bar.arrive() 的调用导致倒计时达到零,则在屏障可以用于对 bar.arrive() 的后续调用之前必须发生对 bar.wait(std::move(token)) 的调用。

  • bar.wait() must only be called using a token object of the current phase or the immediately preceding phase. For any other values of the token object, the behavior is undefined.
    bar.wait() 必须仅使用当前阶段或前一阶段的 token 对象调用。对于 token 对象的任何其他值,行为是未定义的。

For simple arrive/wait synchronization patterns, compliance with these usage rules is straightforward.
对于简单的到达/等待同步模式,遵守这些使用规则是直接的。

7.26.5. Spatial Partitioning (also known as Warp Specialization)
7.26.5. 空间分区(也称为 Warp 专用) 

A thread block can be spatially partitioned such that warps are specialized to perform independent computations. Spatial partitioning is used in a producer or consumer pattern, where one subset of threads produces data that is concurrently consumed by the other (disjoint) subset of threads.
线程块可以在空间上进行分区,使得 warp 专门用于执行独立的计算。空间分区在生产者或消费者模式中使用,其中一部分线程子集生成数据,同时另一部分(不相交的)线程子集并发消费该数据。

A producer/consumer spatial partitioning pattern requires two one sided synchronizations to manage a data buffer between the producer and consumer.
生产者/消费者空间划分模式需要两个单向同步来管理生产者和消费者之间的数据缓冲区。

Producer 生产者

Consumer 消费者

wait for buffer to be ready to be filled
等待缓冲区准备好以填充

signal buffer is ready to be filled
信号缓冲区已准备好填充

produce data and fill the buffer
生成数据并填充缓冲区

signal buffer is filled 信号缓冲区已填满

wait for buffer to be filled
等待缓冲区填满

consume data in filled buffer
消耗填充缓冲区中的数据

Producer threads wait for consumer threads to signal that the buffer is ready to be filled; however, consumer threads do not wait for this signal. Consumer threads wait for producer threads to signal that the buffer is filled; however, producer threads do not wait for this signal. For full producer/consumer concurrency this pattern has (at least) double buffering where each buffer requires two cuda::barriers.
生产者线程等待消费者线程发出信号,表明缓冲区已准备好填充;然而,消费者线程不等待此信号。消费者线程等待生产者线程发出信号,表明缓冲区已填满;然而,生产者线程不等待此信号。对于完整的生产者/消费者并发性,此模式具有(至少)双缓冲,其中每个缓冲区需要两个 cuda::barrier

#include <cuda/barrier>
#include <cooperative_groups.h>

using barrier = cuda::barrier<cuda::thread_scope_block>;

__device__ void producer(barrier ready[], barrier filled[], float* buffer, float* in, int N, int buffer_len)
{
    for (int i = 0; i < (N/buffer_len); ++i) {
        ready[i%2].arrive_and_wait(); /* wait for buffer_(i%2) to be ready to be filled */
        /* produce, i.e., fill in, buffer_(i%2)  */
        barrier::arrival_token token = filled[i%2].arrive(); /* buffer_(i%2) is filled */
    }
}

__device__ void consumer(barrier ready[], barrier filled[], float* buffer, float* out, int N, int buffer_len)
{
    barrier::arrival_token token1 = ready[0].arrive(); /* buffer_0 is ready for initial fill */
    barrier::arrival_token token2 = ready[1].arrive(); /* buffer_1 is ready for initial fill */
    for (int i = 0; i < (N/buffer_len); ++i) {
        filled[i%2].arrive_and_wait(); /* wait for buffer_(i%2) to be filled */
        /* consume buffer_(i%2) */
        barrier::arrival_token token = ready[i%2].arrive(); /* buffer_(i%2) is ready to be re-filled */
    }
}

//N is the total number of float elements in arrays in and out
__global__ void producer_consumer_pattern(int N, int buffer_len, float* in, float* out) {

    // Shared memory buffer declared below is of size 2 * buffer_len
    // so that we can alternatively work between two buffers.
    // buffer_0 = buffer and buffer_1 = buffer + buffer_len
    __shared__ extern float buffer[];

    // bar[0] and bar[1] track if buffers buffer_0 and buffer_1 are ready to be filled,
    // while bar[2] and bar[3] track if buffers buffer_0 and buffer_1 are filled-in respectively
    __shared__ barrier bar[4];


    auto block = cooperative_groups::this_thread_block();
    if (block.thread_rank() < 4)
        init(bar + block.thread_rank(), block.size());
    block.sync();

    if (block.thread_rank() < warpSize)
        producer(bar, bar+2, buffer, in, N, buffer_len);
    else
        consumer(bar, bar+2, buffer, out, N, buffer_len);
}

In this example the first warp is specialized as the producer and the remaining warps are specialized as the consumer. All producer and consumer threads participate (call bar.arrive() or bar.arrive_and_wait()) in each of the four cuda::barriers so the expected arrival counts are equal to block.size().
在此示例中,第一个 warp 被专门指定为生产者,其余的 warp 被专门指定为消费者。所有生产者和消费者线程都参与(调用 bar.arrive()bar.arrive_and_wait() )每个四个 cuda::barrier ,因此预期到达计数等于 block.size()

A producer thread waits for the consumer threads to signal that the shared memory buffer can be filled. In order to wait for a cuda::barrier a producer thread must first arrive on that ready[i%2].arrive() to get a token and then ready[i%2].wait(token) with that token. For simplicity ready[i%2].arrive_and_wait() combines these operations.
生产者线程等待消费者线程发出信号,表明共享内存缓冲区可以填充。为了等待 cuda::barrier ,生产者线程必须首先到达该 ready[i%2].arrive() 以获取令牌,然后使用该令牌 ready[i%2].wait(token) 。为简单起见, ready[i%2].arrive_and_wait() 将这些操作组合在一起。

bar.arrive_and_wait();
/* is equivalent to */
bar.wait(bar.arrive());

Producer threads compute and fill the ready buffer, they then signal that the buffer is filled by arriving on the filled barrier, filled[i%2].arrive(). A producer thread does not wait at this point, instead it waits until the next iteration’s buffer (double buffering) is ready to be filled.
生产者线程计算并填充准备好的缓冲区,然后通过到达填充屏障 filled[i%2].arrive() 来发出缓冲区已填充的信号。生产者线程在这一点上不会等待,而是等到下一次迭代的缓冲区(双缓冲)准备好填充。

A consumer thread begins by signaling that both buffers are ready to be filled. A consumer thread does not wait at this point, instead it waits for this iteration’s buffer to be filled, filled[i%2].arrive_and_wait(). After the consumer threads consume the buffer they signal that the buffer is ready to be filled again, ready[i%2].arrive(), and then wait for the next iteration’s buffer to be filled.
一个消费者线程开始时会发出信号,表示两个缓冲区已准备好填充。消费者线程在这一点上不会等待,而是等待本次迭代的缓冲区被填充, filled[i%2].arrive_and_wait() 。消费者线程消耗缓冲区后,会发出信号表示缓冲区再次准备好填充, ready[i%2].arrive() ,然后等待下一次迭代的缓冲区被填充。

7.26.6. Early Exit (Dropping out of Participation)
7.26.6. 早期退出(退出参与) 

When a thread that is participating in a sequence of synchronizations must exit early from that sequence, that thread must explicitly drop out of participation before exiting. The remaining participating threads can proceed normally with subsequent cuda::barrier arrive and wait operations.
当参与同步序列的线程必须提前退出该序列时,该线程必须在退出之前明确退出参与。其余参与的线程可以继续正常进行后续 cuda::barrier 到达和等待操作。

#include <cuda/barrier>
#include <cooperative_groups.h>

__device__ bool condition_check();

__global__ void early_exit_kernel(int N) {
    using barrier = cuda::barrier<cuda::thread_scope_block>;
    __shared__ barrier bar;
    auto block = cooperative_groups::this_thread_block();

    if (block.thread_rank() == 0)
        init(&bar , block.size());
    block.sync();

    for (int i = 0; i < N; ++i) {
        if (condition_check()) {
          bar.arrive_and_drop();
          return;
        }
        /* other threads can proceed normally */
        barrier::arrival_token token = bar.arrive();
        /* code between arrive and wait */
        bar.wait(std::move(token)); /* wait for all threads to arrive */
        /* code after wait */
    }
}

This operation arrives on the cuda::barrier to fulfill the participating thread’s obligation to arrive in the current phase, and then decrements the expected arrival count for the next phase so that this thread is no longer expected to arrive on the barrier.
此操作在 cuda::barrier 上到达,以履行参与线程到达当前阶段的义务,然后递减下一阶段的预期到达计数,使得该线程不再预期在屏障上到达。

7.26.7. Completion Function
7.26.7. 完成函数 

The CompletionFunction of cuda::barrier<Scope, CompletionFunction> is executed once per phase, after the last thread arrives and before any thread is unblocked from the wait. Memory operations performed by the threads that arrived at the barrier during the phase are visible to the thread executing the CompletionFunction, and all memory operations performed within the CompletionFunction are visible to all threads waiting at the barrier once they are unblocked from the wait.
cuda::barrier<Scope, CompletionFunction>CompletionFunction 在每个阶段执行一次,在最后一个线程到达之后和任何线程从 wait 解除阻塞之前。在阶段期间到达 barrier 的线程执行的内存操作对执行 CompletionFunction 的线程可见,并且在 CompletionFunction 中执行的所有内存操作对所有在 barrier 等待的线程可见,一旦它们从 wait 解除阻塞。

#include <cuda/barrier>
#include <cooperative_groups.h>
#include <functional>
namespace cg = cooperative_groups;

__device__ int divergent_compute(int*, int);
__device__ int independent_computation(int*, int);

__global__ void psum(int* data, int n, int* acc) {
  auto block = cg::this_thread_block();

  constexpr int BlockSize = 128;
  __shared__ int smem[BlockSize];
  assert(BlockSize == block.size());
  assert(n % 128 == 0);

  auto completion_fn = [&] {
    int sum = 0;
    for (int i = 0; i < 128; ++i) sum += smem[i];
    *acc += sum;
  };

  // Barrier storage
  // Note: the barrier is not default-constructible because
  //       completion_fn is not default-constructible due
  //       to the capture.
  using completion_fn_t = decltype(completion_fn);
  using barrier_t = cuda::barrier<cuda::thread_scope_block,
                                  completion_fn_t>;
  __shared__ std::aligned_storage<sizeof(barrier_t),
                                  alignof(barrier_t)> bar_storage;

  // Initialize barrier:
  barrier_t* bar = (barrier_t*)&bar_storage;
  if (block.thread_rank() == 0) {
    assert(*acc == 0);
    assert(blockDim.x == blockDim.y == blockDim.y == 1);
    new (bar) barrier_t{block.size(), completion_fn};
    // equivalent to: init(bar, block.size(), completion_fn);
  }
  block.sync();

  // Main loop
  for (int i = 0; i < n; i += block.size()) {
    smem[block.thread_rank()] = data[i] + *acc;
    auto t = bar->arrive();
    // We can do independent computation here
    bar->wait(std::move(t));
    // shared-memory is safe to re-use in the next iteration
    // since all threads are done with it, including the one
    // that did the reduction
  }
}

7.26.8. Memory Barrier Primitives Interface
7.26.8. 内存屏障原语接口 

Memory barrier primitives are C-like interfaces to cuda::barrier functionality. These primitives are available through including the <cuda_awbarrier_primitives.h> header.
内存屏障原语是对 cuda::barrier 功能的类似 C 接口。通过包含 <cuda_awbarrier_primitives.h> 头文件,可以使用这些原语。

7.26.8.1. Data Types
7.26.8.1. 数据类型 

typedef /* implementation defined */ __mbarrier_t;
typedef /* implementation defined */ __mbarrier_token_t;

7.26.8.2. Memory Barrier Primitives API
7.26.8.2. 内存屏障原语 API 

uint32_t __mbarrier_maximum_count();
void __mbarrier_init(__mbarrier_t* bar, uint32_t expected_count);
  • bar must be a pointer to __shared__ memory.
    bar 必须是指向 __shared__ 内存的指针。

  • expected_count <= __mbarrier_maximum_count()

  • Initialize *bar expected arrival count for the current and next phase to expected_count.
    初始化当前和下一阶段的 *bar 预期到达计数为 expected_count

void __mbarrier_inval(__mbarrier_t* bar);
  • bar must be a pointer to the mbarrier object residing in shared memory.
    bar 必须是指向驻留在共享内存中的 mbarrier 对象的指针。

  • Invalidation of *bar is required before the corresponding shared memory can be repurposed.
    在重新分配相应的共享内存之前,需要使 *bar 失效。

__mbarrier_token_t __mbarrier_arrive(__mbarrier_t* bar);
  • Initialization of *bar must happen before this call.
    *bar 的初始化必须在此调用之前发生。

  • Pending count must not be zero.
    待处理计数不能为零。

  • Atomically decrement the pending count for the current phase of the barrier.
    原子地递减障碍当前阶段的挂起计数。

  • Return an arrival token associated with the barrier state immediately prior to the decrement.
    立即返回与减量之前的屏障状态相关联的到达令牌。

__mbarrier_token_t __mbarrier_arrive_and_drop(__mbarrier_t* bar);
  • Initialization of *bar must happen before this call.
    *bar 的初始化必须在此调用之前发生。

  • Pending count must not be zero.
    待处理计数不能为零。

  • Atomically decrement the pending count for the current phase and expected count for the next phase of the barrier.
    原子地递减屏障当前阶段的挂起计数和下一阶段的预期计数。

  • Return an arrival token associated with the barrier state immediately prior to the decrement.
    立即返回与减量之前的屏障状态相关联的到达令牌。

bool __mbarrier_test_wait(__mbarrier_t* bar, __mbarrier_token_t token);
  • token must be associated with the immediately preceding phase or current phase of *this.
    token 必须与紧随其后的阶段或当前阶段的 *this 相关联。

  • Returns true if token is associated with the immediately preceding phase of *bar, otherwise returns false.
    如果 token*bar 的前一个阶段相关联,则返回 true ,否则返回 false

//Note: This API has been deprecated in CUDA 11.1
uint32_t __mbarrier_pending_count(__mbarrier_token_t token);

7.27. Asynchronous Data Copies
7.27. 异步数据复制 

CUDA 11 introduces Asynchronous Data operations with memcpy_async API to allow device code to explicitly manage the asynchronous copying of data. The memcpy_async feature enables CUDA kernels to overlap computation with data movement.
CUDA 11 引入了 memcpy_async API,以支持设备代码显式管理数据的异步复制。 memcpy_async 功能使 CUDA 内核能够将计算与数据移动重叠。

7.27.1. memcpy_async API

The memcpy_async APIs are provided in the cuda/barrier, cuda/pipeline, and cooperative_groups/memcpy_async.h header files.
memcpy_async API 在 cuda/barriercuda/pipelinecooperative_groups/memcpy_async.h 头文件中提供。

The cuda::memcpy_async APIs work with cuda::barrier and cuda::pipeline synchronization primitives, while the cooperative_groups::memcpy_async synchronizes using coopertive_groups::wait.
cuda::memcpy_async APIs 与 cuda::barriercuda::pipeline 同步原语一起工作,而 cooperative_groups::memcpy_async 使用 coopertive_groups::wait 进行同步。

These APIs have very similar semantics: copy objects from src to dst as-if performed by another thread which, on completion of the copy, can be synchronized through cuda::pipeline, cuda::barrier, or cooperative_groups::wait.
这些 API 具有非常相似的语义:将对象从 src 复制到 dst ,就好像由另一个线程执行一样,在复制完成后可以通过 cuda::pipelinecuda::barriercooperative_groups::wait 进行同步。

The complete API documentation of the cuda::memcpy_async overloads for cuda::barrier and cuda::pipeline is provided in the libcudacxx API documentation along with some examples.
完整的 cuda::barriercuda::pipelinecuda::memcpy_async 重载的 API 文档已在 libcudacxx API 文档中提供,同时附带一些示例。

The API documentation of cooperative_groups::memcpy_async is provided in the cooperative groups Section of the documentation.
cooperative_groups::memcpy_async 的 API 文档在文档的 cooperative groups 部分提供。

The memcpy_async APIs that use cuda::barrier and cuda::pipeline require compute capability 7.0 or higher. On devices with compute capability 8.0 or higher, memcpy_async operations from global to shared memory can benefit from hardware acceleration.
使用 cuda::barrier 和 cuda::pipelinememcpy_async API 需要计算能力 7.0 或更高。在具有计算能力 8.0 或更高的设备上,从全局到共享内存的 memcpy_async 操作可以受益于硬件加速。

7.27.2. Copy and Compute Pattern - Staging Data Through Shared Memory
7.27.2. 复制和计算模式 - 通过共享内存传输数据阶段。

CUDA applications often employ a copy and compute pattern that:
CUDA 应用程序通常采用复制和计算模式:

  • fetches data from global memory,
    从全局内存中获取数据

  • stores data to shared memory, and
    将数据存储到共享内存中,然后

  • performs computations on shared memory data, and potentially writes results back to global memory.
    在共享内存数据上执行计算,并可能将结果写回全局内存。

The following sections illustrate how this pattern can be expressed without and with the memcpy_async feature:
以下部分说明了如何在不使用和使用 memcpy_async 功能的情况下表达此模式:

  • The section Without memcpy_async introduces an example that does not overlap computation with data movement and uses an intermediate register to copy data.
    在不使用 memcpy_async 部分介绍了一个示例,该示例不会将计算与数据移动重叠,并使用中间寄存器来复制数据。

  • The section With memcpy_async improves the previous example by introducing the cooperative_groups::memcpy_async and the cuda::memcpy_async APIs to directly copy data from global to shared memory without using intermediate registers.
    使用 memcpy_async 部分通过引入 cooperative_groups::memcpy_async 和 cuda::memcpy_async API 来改进先前的示例,直接从全局内存复制数据到共享内存,而无需使用中间寄存器。

  • Section Asynchronous Data Copies using cuda::barrier shows memcpy with cooperative groups and barrier
    使用 cuda::barrier 显示了使用协作组和屏障的异步数据拷贝 memcpy

  • Section Single-Stage Asynchronous Data Copies using cuda::pipeline show memcpy with single stage pipeline
    使用 cuda::pipeline 显示单阶段管道的单阶段异步数据拷贝

  • Section Multi-Stage Asynchronous Data Copies using cuda::pipeline show memcpy with multi stage pipeline
    使用 cuda::pipeline 显示多阶段异步数据拷贝的部分 memcpy 与多阶段 pipeline

7.27.3. Without memcpy_async
7.27.3. 没有 memcpy_async

Without memcpy_async, the copy phase of the copy and compute pattern is expressed as shared[local_idx] = global[global_idx]. This global to shared memory copy is expanded to a read from global memory into a register, followed by a write to shared memory from the register.
没有 memcpy_async ,复制和计算模式的复制阶段表示为 shared[local_idx] = global[global_idx] 。这种从全局到共享内存的复制被扩展为从全局内存读取到寄存器,然后从寄存器写入共享内存。

When this pattern occurs within an iterative algorithm, each thread block needs to synchronize after the shared[local_idx] = global[global_idx] assignment, to ensure all writes to shared memory have completed before the compute phase can begin. The thread block also needs to synchronize again after the compute phase, to prevent overwriting shared memory before all threads have completed their computations. This pattern is illustrated in the following code snippet.
当此模式出现在迭代算法中时,每个线程块在 shared[local_idx] = global[global_idx] 赋值后需要同步,以确保在计算阶段开始之前所有对共享内存的写入都已完成。线程块还需要在计算阶段后再次同步,以防止在所有线程完成计算之前覆盖共享内存。此模式在以下代码片段中说明。

#include <cooperative_groups.h>
__device__ void compute(int* global_out, int const* shared_in) {
    // Computes using all values of current batch from shared memory.
    // Stores this thread's result back to global memory.
}

__global__ void without_memcpy_async(int* global_out, int const* global_in, size_t size, size_t batch_sz) {
  auto grid = cooperative_groups::this_grid();
  auto block = cooperative_groups::this_thread_block();
  assert(size == batch_sz * grid.size()); // Exposition: input size fits batch_sz * grid_size

  extern __shared__ int shared[]; // block.size() * sizeof(int) bytes

  size_t local_idx = block.thread_rank();

  for (size_t batch = 0; batch < batch_sz; ++batch) {
    // Compute the index of the current batch for this block in global memory:
    size_t block_batch_idx = block.group_index().x * block.size() + grid.size() * batch;
    size_t global_idx = block_batch_idx + threadIdx.x;
    shared[local_idx] = global_in[global_idx];

    block.sync(); // Wait for all copies to complete

    compute(global_out + block_batch_idx, shared); // Compute and write result to global memory

    block.sync(); // Wait for compute using shared memory to finish
  }
}

7.27.4. With memcpy_async
7.27.4. 使用 memcpy_async

With memcpy_async, the assignment of shared memory from global memory
使用 memcpy_async ,从全局内存分配共享内存

shared[local_idx] = global_in[global_idx];

is replaced with an asynchronous copy operation from cooperative groups
被替换为来自协作组的异步复制操作

cooperative_groups::memcpy_async(group, shared, global_in + batch_idx, sizeof(int) * block.size());

The cooperative_groups::memcpy_async API copies sizeof(int) * block.size() bytes from global memory starting at global_in + batch_idx to the shared data. This operation happens as-if performed by another thread, which synchronizes with the current thread’s call to cooperative_groups::wait after the copy has completed. Until the copy operation completes, modifying the global data or reading or writing the shared data introduces a data race.
cooperative_groups::memcpy_async API 从全局内存中从 global_in + batch_idx 处开始复制 sizeof(int) * block.size() 字节到 shared 数据。此操作会像由另一个线程执行一样,该线程在复制完成后与当前线程的 cooperative_groups::wait 调用同步。在复制操作完成之前,修改全局数据或读取或写入共享数据会引入数据竞争。

On devices with compute capability 8.0 or higher, memcpy_async transfers from global to shared memory can benefit from hardware acceleration, which avoids transfering the data through an intermediate register.
在具有计算能力 8.0 或更高版本的设备上,从全局到共享内存的 memcpy_async 传输可以受益于硬件加速,从而避免通过中间寄存器传输数据。

#include <cooperative_groups.h>
#include <cooperative_groups/memcpy_async.h>

__device__ void compute(int* global_out, int const* shared_in);

__global__ void with_memcpy_async(int* global_out, int const* global_in, size_t size, size_t batch_sz) {
  auto grid = cooperative_groups::this_grid();
  auto block = cooperative_groups::this_thread_block();
  assert(size == batch_sz * grid.size()); // Exposition: input size fits batch_sz * grid_size

  extern __shared__ int shared[]; // block.size() * sizeof(int) bytes

  for (size_t batch = 0; batch < batch_sz; ++batch) {
    size_t block_batch_idx = block.group_index().x * block.size() + grid.size() * batch;
    // Whole thread-group cooperatively copies whole batch to shared memory:
    cooperative_groups::memcpy_async(block, shared, global_in + block_batch_idx, sizeof(int) * block.size());

    cooperative_groups::wait(block); // Joins all threads, waits for all copies to complete

    compute(global_out + block_batch_idx, shared);

    block.sync();
  }
}}

7.27.5. Asynchronous Data Copies using cuda::barrier
7.27.5. 使用 cuda::barrier 进行异步数据复制 

The cuda::memcpy_async overload for cuda::barrier enables synchronizing asynchronous data transfers using a barrier. This overloads executes the copy operation as-if performed by another thread bound to the barrier by: incrementing the expected count of the current phase on creation, and decrementing it on completion of the copy operation, such that the phase of the barrier will only advance when all threads participating in the barrier have arrived, and all memcpy_async bound to the current phase of the barrier have completed. The following example uses a block-wide barrier, where all block threads participate, and swaps the wait operation with a barrier arrive_and_wait, while providing the same functionality as the previous example:
cuda::barrier 的 cuda::memcpy_async 重载允许使用 barrier 来同步异步数据传输。这些重载执行复制操作,就好像由另一个绑定到屏障的线程执行一样:在创建时增加当前阶段的预期计数,并在复制操作完成时减少它,这样当所有参与屏障的线程都到达时, barrier 的阶段才会前进,而绑定到屏障当前阶段的所有 memcpy_async 都已完成。以下示例使用了一个整个块的 barrier ,其中所有块线程都参与其中,并将等待操作与屏障 arrive_and_wait 进行了交换,同时提供与前一个示例相同的功能:

#include <cooperative_groups.h>
#include <cuda/barrier>
__device__ void compute(int* global_out, int const* shared_in);

__global__ void with_barrier(int* global_out, int const* global_in, size_t size, size_t batch_sz) {
  auto grid = cooperative_groups::this_grid();
  auto block = cooperative_groups::this_thread_block();
  assert(size == batch_sz * grid.size()); // Assume input size fits batch_sz * grid_size

  extern __shared__ int shared[]; // block.size() * sizeof(int) bytes

  // Create a synchronization object (C++20 barrier)
  __shared__ cuda::barrier<cuda::thread_scope::thread_scope_block> barrier;
  if (block.thread_rank() == 0) {
    init(&barrier, block.size()); // Friend function initializes barrier
  }
  block.sync();

  for (size_t batch = 0; batch < batch_sz; ++batch) {
    size_t block_batch_idx = block.group_index().x * block.size() + grid.size() * batch;
    cuda::memcpy_async(block, shared, global_in + block_batch_idx, sizeof(int) * block.size(), barrier);

    barrier.arrive_and_wait(); // Waits for all copies to complete

    compute(global_out + block_batch_idx, shared);

    block.sync();
  }
}

7.27.6. Performance Guidance for memcpy_async
7.27.6. memcpy_async 的性能指导 

For compute capability 8.x, the pipeline mechanism is shared among CUDA threads in the same CUDA warp. This sharing causes batches of memcpy_async to be entangled within a warp, which can impact performance under certain circumstances.
对于计算能力 8.x,流水线机制在同一 CUDA warp 中的 CUDA 线程之间共享。这种共享会导致一组 memcpy_async 在一个 warp 中纠缠在一起,这可能会在某些情况下影响性能。

This section highlights the warp-entanglement effect on commit, wait, and arrive operations. Please refer to the Pipeline Interface and the Pipeline Primitives Interface for an overview of the individual operations.
本节重点介绍了对提交、等待和到达操作的 warp-纠缠效应。有关各个操作的概述,请参阅 Pipeline 接口和 Pipeline 原语接口。

7.27.6.1. Alignment 7.27.6.1. 对齐方式 

On devices with compute capability 8.0, the cp.async family of instructions allows copying data from global to shared memory asynchronously. These instructions support copying 4, 8, and 16 bytes at a time. If the size provided to memcpy_async is a multiple of 4, 8, or 16, and both pointers passed to memcpy_async are aligned to a 4, 8, or 16 alignment boundary, then memcpy_async can be implemented using exclusively asynchronous memory operations.
在具有计算能力 8.0 的设备上,cp.async 指令系列允许异步地将数据从全局内存复制到共享内存。这些指令支持一次复制 4、8 和 16 字节。如果提供给 memcpy_async 的大小是 4、8 或 16 的倍数,并且传递给 memcpy_async 的两个指针都对齐到 4、8 或 16 的边界,那么 memcpy_async 可以完全使用异步内存操作来实现。

Additionally for achieving best performance when using memcpy_async API, an alignment of 128 Bytes for both shared memory and global memory is required.
另外,为了在使用 memcpy_async API 时获得最佳性能,需要将共享内存和全局内存的对齐设置为 128 字节。

For pointers to values of types with an alignment requirement of 1 or 2, it is often not possible to prove that the pointers are always aligned to a higher alignment boundary. Determining whether the cp.async instructions can or cannot be used must be delayed until run-time. Performing such a runtime alignment check increases code-size and adds runtime overhead.
对于具有对齐要求为 1 或 2 的类型值的指针,通常无法证明指针始终对齐到更高的对齐边界。确定 cp.async 指令是否可以使用必须延迟到运行时。执行此类运行时对齐检查会增加代码大小并增加运行时开销。

The cuda::aligned_size_t<size_t Align>(size_t size)Shape can be used to supply a proof that both pointers passed to memcpy_async are aligned to an Align alignment boundary and that size is a multiple of Align, by passing it as an argument where the memcpy_async APIs expect a Shape:
cuda::aligned_size_t(size_t size)形状可用于提供证明,即传递给 memcpy_async 的两个指针都对齐到 Align 对齐边界,并且 sizeAlign 的倍数,通过将其作为参数传递给 memcpy_async 期望 Shape 的 API

cuda::memcpy_async(group, dst, src, cuda::aligned_size_t<16>(N * block.size()), pipeline);

If the proof is incorrect, the behavior is undefined.
如果证明是不正确的,则行为是未定义的。

7.27.6.2. Trivially copyable
7.27.6.2. 可平凡复制 

On devices with compute capability 8.0, the cp.async family of instructions allows copying data from global to shared memory asynchronously. If the pointer types passed to memcpy_async do not point to TriviallyCopyable types, the copy constructor of each output element needs to be invoked, and these instructions cannot be used to accelerate memcpy_async.
在具有计算能力 8.0 的设备上,cp.async 指令系列允许异步地将数据从全局内存复制到共享内存。如果传递给 memcpy_async 的指针类型不指向 TriviallyCopyable 类型,则需要调用每个输出元素的复制构造函数,并且这些指令无法用于加速 memcpy_async

7.27.6.3. Warp Entanglement - Commit
7.27.6.3. Warp Entanglement - 提交 

The sequence of memcpy_async batches is shared across the warp. The commit operation is coalesced such that the sequence is incremented once for all converged threads that invoke the commit operation. If the warp is fully converged, the sequence is incremented by one; if the warp is fully diverged, the sequence is incremented by 32.
memcpy_async 批次的顺序在 warp 中共享。提交操作是合并的,使得对提交操作调用的所有汇聚线程一次递增一次。如果 warp 完全汇聚,则顺序递增 1;如果 warp 完全分散,则顺序递增 32。

  • Let PB be the warp-shared pipeline’s actual sequence of batches.
    让 PB 成为共享流水线的实际批次序列。

    PB = {BP0, BP1, BP2, …, BPL}

  • Let TB be a thread’s perceived sequence of batches, as if the sequence were only incremented by this thread’s invocation of the commit operation.
    让 TB 成为线程的感知批次序列,就好像该序列仅由该线程调用提交操作而增加。

    TB = {BT0, BT1, BT2, …, BTL}

    The pipeline::producer_commit() return value is from the thread’s perceived batch sequence.
    pipeline::producer_commit() 返回值来自线程的感知批处理序列。

  • An index in a thread’s perceived sequence always aligns to an equal or larger index in the actual warp-shared sequence. The sequences are equal only when all commit operations are invoked from converged threads.
    线程感知序列中的索引始终与实际 warp-shared 序列中的相等或更大的索引对齐。仅当所有提交操作都是从收敛的线程调用时,这些序列才相等。

    BTn BPm where n <= m  BTn BPmn <= m

For example, when a warp is fully diverged:
例如,当一个 warp 完全分散时:

  • The warp-shared pipeline’s actual sequence would be: PB = {0, 1, 2, 3, ..., 31} (PL=31).
    warp-shared pipeline 的实际序列将是: PB = {0, 1, 2, 3, ..., 31} ( PL=31 )。

  • The perceived sequence for each thread of this warp would be:
    这个 warp 中每个线程的感知顺序将是:

    • Thread 0: TB = {0} (TL=0)
      线程 0: TB = {0} ( TL=0 )

    • Thread 1: TB = {0} (TL=0)
      线程 1: TB = {0} ( TL=0 )

    • Thread 31: TB = {0} (TL=0)
      线程 31: TB = {0} ( TL=0 )

7.27.6.4. Warp Entanglement - Wait
7.27.6.4. Warp Entanglement - 等待 

A CUDA thread invokes either pipeline_consumer_wait_prior<N>() or pipeline::consumer_wait() to wait for batches in the perceived sequence TB to complete. Note that pipeline::consumer_wait() is equivalent to pipeline_consumer_wait_prior<N>(), where N =                                        PL.
CUDA 线程调用 pipeline_consumer_wait_prior<N>()pipeline::consumer_wait() 等待感知序列 TB 中的批次完成。请注意, pipeline::consumer_wait() 等同于 pipeline_consumer_wait_prior<N>() ,其中 N =                                        PL

The pipeline_consumer_wait_prior<N>() function waits for batches in the actual sequence at least up to and including PL-N. Since TL <= PL, waiting for batch up to and including PL-N includes waiting for batch TL-N. Thus, when TL < PL, the thread will unintentionally wait for additional, more recent batches.
pipeline_consumer_wait_prior<N>() 函数至少等待实际序列中的批次,直到包括 PL-N 。自 TL <= PL 以来,等待批次直到包括 PL-N 包括等待批次 TL-N 。因此,当 TL < PL 时,线程将无意中等待额外的、更新的批次。

In the extreme fully-diverged warp example above, each thread could wait for all 32 batches.
在上面极端完全分歧的 warp 示例中,每个线程都可以等待所有 32 批次。

7.27.6.5. Warp Entanglement - Arrive-On

Warp-divergence affects the number of times an arrive_on(bar) operation updates the barrier. If the invoking warp is fully converged, then the barrier is updated once. If the invoking warp is fully diverged, then 32 individual updates are applied to the barrier.
Warp 分歧会影响 arrive_on(bar) 操作更新屏障的次数。如果调用的 Warp 完全收敛,则屏障会更新一次。如果调用的 Warp 完全分散,则会对屏障应用 32 次单独的更新。

7.27.6.6. Keep Commit and Arrive-On Operations Converged
7.27.6.6. 保持提交和到达操作收敛 

It is recommended that commit and arrive-on invocations are by converged threads:
建议使用收敛线程进行提交和到达调用:

  • to not over-wait, by keeping threads’ perceived sequence of batches aligned with the actual sequence, and
    为了不过度等待,通过保持线程的批次的感知顺序与实际顺序保持一致,以及

  • to minimize updates to the barrier object.
    最小化对屏障对象的更新。

When code preceding these operations diverges threads, then the warp should be re-converged, via __syncwarp before invoking commit or arrive-on operations.
当在这些操作之前的代码分叉线程时,应该通过 __syncwarp 在调用提交或到达操作之前重新汇聚 warp。

7.28. Asynchronous Data Copies using cuda::pipeline
7.28. 使用 cuda::pipeline 进行异步数据复制 

CUDA provides the cuda::pipeline synchronization object to manage and overlap asynchronous data movement with computation.
CUDA 提供 cuda::pipeline 同步对象来管理和重叠异步数据移动和计算。

The API documentation for cuda::pipeline is provided in the libcudacxx API. A pipeline object is a double-ended N stage queue with a head and a tail, and is used to process work in a first-in first-out (FIFO) order. The pipeline object has following member functions to manage the stages of the pipeline.
cuda::pipeline 的 API 文档在 libcudacxx API 中提供。管道对象是一个具有头部和尾部的双端 N 阶段队列,用于按先进先出(FIFO)顺序处理工作。管道对象具有以下成员函数来管理管道的各个阶段。

Pipeline Class Member Function
管道类成员函数

Description 描述

producer_acquire

Acquires an available stage in the pipeline internal queue.
获取管道内部队列中的可用阶段。

producer_commit

Commits the asynchronous operations issued after the producer_acquire call on the currently acquired stage of the pipeline.
提交在当前获取的管道阶段上 producer_acquire 调用后发出的异步操作。

consumer_wait

Wait for completion of all asynchronous operations on the oldest stage of the pipeline.
等待管道最旧阶段上所有异步操作完成。

consumer_release

Release the oldest stage of the pipeline to the pipeline object for reuse. The released stage can be then acquired by the producer.
将管道的最老阶段释放到管道对象以便重复使用。然后生产者可以获取已释放的阶段。

7.28.1. Single-Stage Asynchronous Data Copies using cuda::pipeline
7.28.1. 使用 cuda::pipeline 进行单阶段异步数据复制 

In previous examples we showed how to use cooperative_groups and cuda::barrier to do asynchronous data transfers. In this section, we will use the cuda::pipeline API with a single stage to schedule asynchronous copies. And later we will expand this example to show multi staged overlapped compute and copy.
在之前的示例中,我们展示了如何使用 cooperative_groups 和 cuda::barrier 来执行异步数据传输。在本节中,我们将使用 cuda::pipeline API 来安排异步拷贝的单个阶段。稍后我们将扩展这个示例,展示多阶段重叠的计算和拷贝。

#include <cooperative_groups/memcpy_async.h>
#include <cuda/pipeline>

__device__ void compute(int* global_out, int const* shared_in);
__global__ void with_single_stage(int* global_out, int const* global_in, size_t size, size_t batch_sz) {
    auto grid = cooperative_groups::this_grid();
    auto block = cooperative_groups::this_thread_block();
    assert(size == batch_sz * grid.size()); // Assume input size fits batch_sz * grid_size

    constexpr size_t stages_count = 1; // Pipeline with one stage
    // One batch must fit in shared memory:
    extern __shared__ int shared[];  // block.size() * sizeof(int) bytes

    // Allocate shared storage for a single stage cuda::pipeline:
    __shared__ cuda::pipeline_shared_state<
        cuda::thread_scope::thread_scope_block,
        stages_count
    > shared_state;
    auto pipeline = cuda::make_pipeline(block, &shared_state);

    // Each thread processes `batch_sz` elements.
    // Compute offset of the batch `batch` of this thread block in global memory:
    auto block_batch = [&](size_t batch) -> int {
      return block.group_index().x * block.size() + grid.size() * batch;
    };

    for (size_t batch = 0; batch < batch_sz; ++batch) {
        size_t global_idx = block_batch(batch);

        // Collectively acquire the pipeline head stage from all producer threads:
        pipeline.producer_acquire();

        // Submit async copies to the pipeline's head stage to be
        // computed in the next loop iteration
        cuda::memcpy_async(block, shared, global_in + global_idx, sizeof(int) * block.size(), pipeline);
        // Collectively commit (advance) the pipeline's head stage
        pipeline.producer_commit();

        // Collectively wait for the operations committed to the
        // previous `compute` stage to complete:
        pipeline.consumer_wait();

        // Computation overlapped with the memcpy_async of the "copy" stage:
        compute(global_out + global_idx, shared);

        // Collectively release the stage resources
        pipeline.consumer_release();
    }
}

7.28.2. Multi-Stage Asynchronous Data Copies using cuda::pipeline
7.28.2. 使用 cuda::pipeline 进行多阶段异步数据复制 

In the previous examples with cooperative_groups::wait and cuda::barrier, the kernel threads immediately wait for the data transfer to shared memory to complete. This avoids data transfers from global memory into registers, but does not hide the latency of the memcpy_async operation by overlapping computation.
在先前的示例中,使用 cooperative_groups::wait 和 cuda::barrier,内核线程立即等待数据传输到共享内存完成。这样可以避免从全局内存传输数据到寄存器,但不能通过重叠计算来隐藏 memcpy_async 操作的延迟。

For that we use the CUDA pipeline feature in the following example. It provides a mechanism for managing a sequence of memcpy_async batches, enabling CUDA kernels to overlap memory transfers with computation. The following example implements a two-stage pipeline that overlaps data-transfer with computation. It:
为此,我们在以下示例中使用 CUDA 管道功能。它提供了一种管理 memcpy_async 批次序列的机制,使 CUDA 内核能够将内存传输与计算重叠。以下示例实现了一个两阶段管道,可以将数据传输与计算重叠。它:

  • Initializes the pipeline shared state (more below)
    初始化管道共享状态(更多信息请参见下文)

  • Kickstarts the pipeline by scheduling a memcpy_async for the first batch.
    通过为第一批次安排 memcpy_async 来启动流水线。

  • Loops over all the batches: it schedules memcpy_async for the next batch, blocks all threads on the completion of the memcpy_async for the previous batch, and then overlaps the computation on the previous batch with the asynchronous copy of the memory for the next batch.
    循环遍历所有批次:为下一批次调度 memcpy_async ,在上一批次 memcpy_async 完成时阻塞所有线程,然后将上一批次的计算与下一批次内存的异步复制重叠。

  • Finally, it drains the pipeline by performing the computation on the last batch.
    最后,通过对最后一个批次执行计算来清空管道。

Note that, for interoperability with cuda::pipeline, cuda::memcpy_async from the cuda/pipeline header is used here.
请注意,为了与 cuda::pipeline 互操作,此处使用了来自 cuda/pipeline 头部的 cuda::memcpy_async

#include <cooperative_groups/memcpy_async.h>
#include <cuda/pipeline>

__device__ void compute(int* global_out, int const* shared_in);
__global__ void with_staging(int* global_out, int const* global_in, size_t size, size_t batch_sz) {
    auto grid = cooperative_groups::this_grid();
    auto block = cooperative_groups::this_thread_block();
    assert(size == batch_sz * grid.size()); // Assume input size fits batch_sz * grid_size

    constexpr size_t stages_count = 2; // Pipeline with two stages
    // Two batches must fit in shared memory:
    extern __shared__ int shared[];  // stages_count * block.size() * sizeof(int) bytes
    size_t shared_offset[stages_count] = { 0, block.size() }; // Offsets to each batch

    // Allocate shared storage for a two-stage cuda::pipeline:
    __shared__ cuda::pipeline_shared_state<
        cuda::thread_scope::thread_scope_block,
        stages_count
    > shared_state;
    auto pipeline = cuda::make_pipeline(block, &shared_state);

    // Each thread processes `batch_sz` elements.
    // Compute offset of the batch `batch` of this thread block in global memory:
    auto block_batch = [&](size_t batch) -> int {
      return block.group_index().x * block.size() + grid.size() * batch;
    };

    // Initialize first pipeline stage by submitting a `memcpy_async` to fetch a whole batch for the block:
    if (batch_sz == 0) return;
    pipeline.producer_acquire();
    cuda::memcpy_async(block, shared + shared_offset[0], global_in + block_batch(0), sizeof(int) * block.size(), pipeline);
    pipeline.producer_commit();

    // Pipelined copy/compute:
    for (size_t batch = 1; batch < batch_sz; ++batch) {
        // Stage indices for the compute and copy stages:
        size_t compute_stage_idx = (batch - 1) % 2;
        size_t copy_stage_idx = batch % 2;

        size_t global_idx = block_batch(batch);

        // Collectively acquire the pipeline head stage from all producer threads:
        pipeline.producer_acquire();

        // Submit async copies to the pipeline's head stage to be
        // computed in the next loop iteration
        cuda::memcpy_async(block, shared + shared_offset[copy_stage_idx], global_in + global_idx, sizeof(int) * block.size(), pipeline);
        // Collectively commit (advance) the pipeline's head stage
        pipeline.producer_commit();

        // Collectively wait for the operations commited to the
        // previous `compute` stage to complete:
        pipeline.consumer_wait();

        // Computation overlapped with the memcpy_async of the "copy" stage:
        compute(global_out + global_idx, shared + shared_offset[compute_stage_idx]);

        // Collectively release the stage resources
        pipeline.consumer_release();
    }

    // Compute the data fetch by the last iteration
    pipeline.consumer_wait();
    compute(global_out + block_batch(batch_sz-1), shared + shared_offset[(batch_sz - 1) % 2]);
    pipeline.consumer_release();
}

A pipeline object is a double-ended queue with a head and a tail, and is used to process work in a first-in first-out (FIFO) order. Producer threads commit work to the pipeline’s head, while consumer threads pull work from the pipeline’s tail. In the example above, all threads are both producer and consumer threads. The threads first commitmemcpy_async operations to fetch the next batch while they wait on the previous batch of memcpy_async operations to complete.
管道对象是一个具有头部和尾部的双端队列,用于按照先进先出(FIFO)顺序处理工作。生产者线程向管道的头部提交工作,而消费者线程从管道的尾部拉取工作。在上面的示例中,所有线程既是生产者又是消费者线程。线程首先提交 memcpy_async 个操作以获取下一批次,同时等待上一批 memcpy_async 个操作完成。

  • Committing work to a pipeline stage involves:
    将工作提交到流水线阶段涉及:

    • Collectively acquiring the pipeline head from a set of producer threads using pipeline.producer_acquire().
      从一组生产者线程中使用 pipeline.producer_acquire() 共同获取管道头。

    • Submitting memcpy_async operations to the pipeline head.
      提交 memcpy_async 个操作到管道头。

    • Collectively commiting (advancing) the pipeline head using pipeline.producer_commit().
      使用 pipeline.producer_commit() 集体提交(推进)管道头。

  • Using a previously commited stage involves:
    使用先前提交的阶段包括:

    • Collectively waiting for the stage to complete, e.g., using pipeline.consumer_wait() to wait on the tail (oldest) stage.
      集体等待阶段完成,例如,使用 pipeline.consumer_wait() 等待尾部(最旧)阶段。

    • Collectively releasing the stage using pipeline.consumer_release().
      使用 pipeline.consumer_release() 集体释放舞台。

cuda::pipeline_shared_state<scope, count> encapsulates the finite resources that allow a pipeline to process up to count concurrent stages. If all resources are in use, pipeline.producer_acquire() blocks producer threads until the resources of the next pipeline stage are released by consumer threads.
cuda::pipeline_shared_state<scope, count> 封装了有限资源,允许管道处理最多 count 个并发阶段。如果所有资源都在使用中, pipeline.producer_acquire() 会阻塞生产者线程,直到下一个管道阶段的资源被消费者线程释放。

This example can be written in a more concise manner by merging the prolog and epilog of the loop with the loop itself as follows:
这个示例可以通过将循环的前言和尾声与循环本身合并为更简洁的方式来编写,如下所示:

template <size_t stages_count = 2 /* Pipeline with stages_count stages */>
__global__ void with_staging_unified(int* global_out, int const* global_in, size_t size, size_t batch_sz) {
    auto grid = cooperative_groups::this_grid();
    auto block = cooperative_groups::this_thread_block();
    assert(size == batch_sz * grid.size()); // Assume input size fits batch_sz * grid_size

    extern __shared__ int shared[]; // stages_count * block.size() * sizeof(int) bytes
    size_t shared_offset[stages_count];
    for (int s = 0; s < stages_count; ++s) shared_offset[s] = s * block.size();

    __shared__ cuda::pipeline_shared_state<
        cuda::thread_scope::thread_scope_block,
        stages_count
    > shared_state;
    auto pipeline = cuda::make_pipeline(block, &shared_state);

    auto block_batch = [&](size_t batch) -> int {
        return block.group_index().x * block.size() + grid.size() * batch;
    };

    // compute_batch: next batch to process
    // fetch_batch:  next batch to fetch from global memory
    for (size_t compute_batch = 0, fetch_batch = 0; compute_batch < batch_sz; ++compute_batch) {
        // The outer loop iterates over the computation of the batches
        for (; fetch_batch < batch_sz && fetch_batch < (compute_batch + stages_count); ++fetch_batch) {
            // This inner loop iterates over the memory transfers, making sure that the pipeline is always full
            pipeline.producer_acquire();
            size_t shared_idx = fetch_batch % stages_count;
            size_t batch_idx = fetch_batch;
            size_t block_batch_idx = block_batch(batch_idx);
            cuda::memcpy_async(block, shared + shared_offset[shared_idx], global_in + block_batch_idx, sizeof(int) * block.size(), pipeline);
            pipeline.producer_commit();
        }
        pipeline.consumer_wait();
        int shared_idx = compute_batch % stages_count;
        int batch_idx = compute_batch;
        compute(global_out + block_batch(batch_idx), shared + shared_offset[shared_idx]);
        pipeline.consumer_release();
    }
}

The pipeline<thread_scope_block> primitive used above is very flexible, and supports two features that our examples above are not using: any arbitrary subset of threads in the block can participate in the pipeline, and from the threads that participate, any subsets can be producers, consumers, or both. In the following example, threads with an “even” thread rank are producers, while other threads are consumers:
上面使用的 pipeline<thread_scope_block> 原语非常灵活,并支持两个特性,我们上面的示例没有使用:块中的任意子线程可以参与 pipeline ,参与的线程可以是生产者、消费者或两者。在以下示例中,具有“偶数”线程等级的线程是生产者,而其他线程是消费者:

__device__ void compute(int* global_out, int shared_in);

template <size_t stages_count = 2>
__global__ void with_specialized_staging_unified(int* global_out, int const* global_in, size_t size, size_t batch_sz) {
    auto grid = cooperative_groups::this_grid();
    auto block = cooperative_groups::this_thread_block();

    // In this example, threads with "even" thread rank are producers, while threads with "odd" thread rank are consumers:
    const cuda::pipeline_role thread_role
      = block.thread_rank() % 2 == 0? cuda::pipeline_role::producer : cuda::pipeline_role::consumer;

    // Each thread block only has half of its threads as producers:
    auto producer_threads = block.size() / 2;

    // Map adjacent even and odd threads to the same id:
    const int thread_idx = block.thread_rank() / 2;

    auto elements_per_batch = size / batch_sz;
    auto elements_per_batch_per_block = elements_per_batch / grid.group_dim().x;

    extern __shared__ int shared[]; // stages_count * elements_per_batch_per_block * sizeof(int) bytes
    size_t shared_offset[stages_count];
    for (int s = 0; s < stages_count; ++s) shared_offset[s] = s * elements_per_batch_per_block;

    __shared__ cuda::pipeline_shared_state<
        cuda::thread_scope::thread_scope_block,
        stages_count
    > shared_state;
    cuda::pipeline pipeline = cuda::make_pipeline(block, &shared_state, thread_role);

    // Each thread block processes `batch_sz` batches.
    // Compute offset of the batch `batch` of this thread block in global memory:
    auto block_batch = [&](size_t batch) -> int {
      return elements_per_batch * batch + elements_per_batch_per_block * blockIdx.x;
    };

    for (size_t compute_batch = 0, fetch_batch = 0; compute_batch < batch_sz; ++compute_batch) {
        // The outer loop iterates over the computation of the batches
        for (; fetch_batch < batch_sz && fetch_batch < (compute_batch + stages_count); ++fetch_batch) {
            // This inner loop iterates over the memory transfers, making sure that the pipeline is always full
            if (thread_role == cuda::pipeline_role::producer) {
                // Only the producer threads schedule asynchronous memcpys:
                pipeline.producer_acquire();
                size_t shared_idx = fetch_batch % stages_count;
                size_t batch_idx = fetch_batch;
                size_t global_batch_idx = block_batch(batch_idx) + thread_idx;
                size_t shared_batch_idx = shared_offset[shared_idx] + thread_idx;
                cuda::memcpy_async(shared + shared_batch_idx, global_in + global_batch_idx, sizeof(int), pipeline);
                pipeline.producer_commit();
            }
        }
        if (thread_role == cuda::pipeline_role::consumer) {
            // Only the consumer threads compute:
            pipeline.consumer_wait();
            size_t shared_idx = compute_batch % stages_count;
            size_t global_batch_idx = block_batch(compute_batch) + thread_idx;
            size_t shared_batch_idx = shared_offset[shared_idx] + thread_idx;
            compute(global_out + global_batch_idx, *(shared + shared_batch_idx));
            pipeline.consumer_release();
        }
    }
}

There are some optimizations that pipeline performs, for example, when all threads are both producers and consumers, but in general, the cost of supporting all these features cannot be fully eliminated. For example, pipeline stores and uses a set of barriers in shared memory for synchronization, which is not really necessary if all threads in the block participate in the pipeline.
有一些优化是 pipeline 执行的,例如,当所有线程既是生产者又是消费者时,但总的来说,支持所有这些特性的成本是无法完全消除的。例如, pipeline 在共享内存中存储和使用一组屏障进行同步,如果块中的所有线程都参与流水线,则这并不是真正必要的。

For the particular case in which all threads in the block participate in the pipeline, we can do better than pipeline<thread_scope_block> by using a pipeline<thread_scope_thread> combined with __syncthreads():
对于块中所有线程都参与 pipeline 的特殊情况,我们可以通过使用 pipeline<thread_scope_thread>__syncthreads() 结合来比 pipeline<thread_scope_block> 更好:

template<size_t stages_count>
__global__ void with_staging_scope_thread(int* global_out, int const* global_in, size_t size, size_t batch_sz) {
    auto grid = cooperative_groups::this_grid();
    auto block = cooperative_groups::this_thread_block();
    auto thread = cooperative_groups::this_thread();
    assert(size == batch_sz * grid.size()); // Assume input size fits batch_sz * grid_size

    extern __shared__ int shared[]; // stages_count * block.size() * sizeof(int) bytes
    size_t shared_offset[stages_count];
    for (int s = 0; s < stages_count; ++s) shared_offset[s] = s * block.size();

    // No pipeline::shared_state needed
    cuda::pipeline<cuda::thread_scope_thread> pipeline = cuda::make_pipeline();

    auto block_batch = [&](size_t batch) -> int {
        return block.group_index().x * block.size() + grid.size() * batch;
    };

    for (size_t compute_batch = 0, fetch_batch = 0; compute_batch < batch_sz; ++compute_batch) {
        for (; fetch_batch < batch_sz && fetch_batch < (compute_batch + stages_count); ++fetch_batch) {
            pipeline.producer_acquire();
            size_t shared_idx = fetch_batch % stages_count;
            size_t batch_idx = fetch_batch;
            // Each thread fetches its own data:
            size_t thread_batch_idx = block_batch(batch_idx) + threadIdx.x;
            // The copy is performed by a single `thread` and the size of the batch is now that of a single element:
            cuda::memcpy_async(thread, shared + shared_offset[shared_idx] + threadIdx.x, global_in + thread_batch_idx, sizeof(int), pipeline);
            pipeline.producer_commit();
        }
        pipeline.consumer_wait();
        block.sync(); // __syncthreads: All memcpy_async of all threads in the block for this stage have completed here
        int shared_idx = compute_batch % stages_count;
        int batch_idx = compute_batch;
        compute(global_out + block_batch(batch_idx), shared + shared_offset[shared_idx]);
        pipeline.consumer_release();
    }
}

If the compute operation only reads shared memory written to by other threads in the same warp as the current thread, __syncwarp() suffices.
如果 compute 操作只读取由当前线程所在 warp 中的其他线程写入的共享内存,则 __syncwarp() 就足够了。

7.28.3. Pipeline Interface
7.28.3. 流水线接口 

The complete API documentation for cuda::memcpy_async is provided in the libcudacxx API documentation along with some examples.
cuda::memcpy_async 的完整 API 文档已在 libcudacxx API 文档中提供,同时附带一些示例。

The pipeline interface requires
pipeline 接口需要

  • at least CUDA 11.0, 至少 CUDA 11.0。

  • at least ISO C++ 2011 compatibility, e.g., to be compiled with -std=c++11, and
    至少需要 ISO C++ 2011 兼容性,例如,可以使用 -std=c++11 进行编译,并

  • #include <cuda/pipeline>.

For a C-like interface, when compiling without ISO C++ 2011 compatibility, see Pipeline Primitives Interface.
对于类似 C 的接口,在不使用 ISO C++ 2011 兼容性编译时,请参阅 Pipeline Primitives Interface。

7.28.4. Pipeline Primitives Interface
7.28.4. 流水线基元接口 

Pipeline primitives are a C-like interface for memcpy_async functionality. The pipeline primitives interface is available by including the <cuda_pipeline.h> header. When compiling without ISO C++ 2011 compatibility, include the <cuda_pipeline_primitives.h> header.
管道基元是一种类似于 C 的接口,用于 memcpy_async 功能。通过包含 <cuda_pipeline.h> 头文件,可以使用管道基元接口。在不包含 ISO C++ 2011 兼容性的编译时,需要包含 <cuda_pipeline_primitives.h> 头文件。

7.28.4.1. memcpy_async Primitive
7.28.4.1. memcpy_async 原始 

void __pipeline_memcpy_async(void* __restrict__ dst_shared,
                             const void* __restrict__ src_global,
                             size_t size_and_align,
                             size_t zfill=0);
  • Request that the following operation be submitted for asynchronous evaluation:
    请求将以下操作提交以进行异步评估:

    size_t i = 0;
    for (; i < size_and_align - zfill; ++i) ((char*)dst_shared)[i] = ((char*)src_global)[i]; /* copy */
    for (; i < size_and_align; ++i) ((char*)dst_shared)[i] = 0; /* zero-fill */
    
  • Requirements: 要求:

    • dst_shared must be a pointer to the shared memory destination for the memcpy_async.
      dst_shared 必须是指向 memcpy_async 的共享内存目标的指针。

    • src_global must be a pointer to the global memory source for the memcpy_async.
      src_global 必须是 memcpy_async 的全局内存源指针。

    • size_and_align must be 4, 8, or 16.
      size_and_align 必须是 4、8 或 16。

    • zfill <= size_and_align.

    • size_and_align must be the alignment of dst_shared and src_global.
      size_and_align 必须与 dst_sharedsrc_global 对齐。

  • It is a race condition for any thread to modify the source memory or observe the destination memory prior to waiting for the memcpy_async operation to complete. Between submitting a memcpy_async operation and waiting for its completion, any of the following actions introduces a race condition:
    任何线程在等待 memcpy_async 操作完成之前修改源内存或观察目标内存都会产生竞争条件。在提交 memcpy_async 操作并等待其完成之间,以下任何操作都会引入竞争条件:

    • Loading from dst_shared. dst_shared 加载。

    • Storing to dst_shared or src_global.
      存储到 dst_sharedsrc_global

    • Applying an atomic update to dst_shared or src_global.
      dst_sharedsrc_global 应用原子更新。

7.28.4.2. Commit Primitive
7.28.4.2. 提交原语 

void __pipeline_commit();
  • Commit submitted memcpy_async to the pipeline as the current batch.
    将提交的 memcpy_async 作为当前批次提交到流水线。

7.28.4.3. Wait Primitive
7.28.4.3. 等待原语 

void __pipeline_wait_prior(size_t N);
  • Let {0, 1, 2, ..., L} be the sequence of indices associated with invocations of __pipeline_commit() by a given thread.
    {0, 1, 2, ..., L} 成为给定线程通过 __pipeline_commit() 调用关联的索引序列。

  • Wait for completion of batches at least up to and including L-N.
    等待批次至少完成到 L-N

7.28.4.4. Arrive On Barrier Primitive
7.28.4.4. 到达屏障原语 

void __pipeline_arrive_on(__mbarrier_t* bar);
  • bar points to a barrier in shared memory.
    bar 指向共享内存中的一个屏障。

  • Increments the barrier arrival count by one, when all memcpy_async operations sequenced before this call have completed, the arrival count is decremented by one and hence the net effect on the arrival count is zero. It is user’s responsibility to make sure that the increment on the arrival count does not exceed __mbarrier_maximum_count().
    将障碍到达计数增加一,当在此调用之前顺序执行的所有 memcpy_async 操作完成时,到达计数减少一,因此到达计数的净效果为零。用户有责任确保到达计数的增量不超过 __mbarrier_maximum_count()

7.29. Asynchronous Data Copies using Tensor Memory Access (TMA)
7.29. 使用张量内存访问(TMA)进行异步数据复制 

Many applications require movement of large amounts of data from and to global memory. Often, the data is laid out in global memory as a multi-dimensional array with non-sequential data acess patterns. To reduce global memory usage, sub-tiles of such arrays are copied to shared memory before use in computations. The loading and storing involves addreses-calculations that can be error-prone and repetitive. To offload these computations, Compute Capability 9.0 introduces Tensor Memory Acces (TMA). The primary goal of TMA is to provide an efficient data transfer mechanism from global memory to shared memory for multi-dimensional arrays.
许多应用程序需要大量数据在全局内存之间移动。通常,数据在全局内存中以多维数组的形式布局,具有非顺序数据访问模式。为了减少全局内存使用量,在计算中使用之前,这些数组的子瓦片被复制到共享内存中。加载和存储涉及地址计算,可能出错且重复。为了卸载这些计算,Compute Capability 9.0 引入了张量内存访问(TMA)。TMA 的主要目标是为多维数组提供从全局内存到共享内存的高效数据传输机制。

Naming. Tensor memory access (TMA) is a broad term used to market the features described in this section. For the purpose of forward-compatibility and to reduce discrepancies with the PTX ISA, the text in this section refers to TMA operations as either bulk-asynchronous copies or bulk tensor asynchronous copies, depending on the specific type of copy used. The term “bulk” is used to contrast these operations with the asynchronous memory operations described in the previous sections.
命名。张量内存访问(TMA)是一个广泛的术语,用于推广本节中描述的功能。为了实现向前兼容性并减少与 PTX ISA 的差异,本节中的文本将 TMA 操作称为批量异步拷贝或批量张量异步拷贝,具体取决于所使用的拷贝类型。术语“批量”用于将这些操作与前几节中描述的异步内存操作进行对比。

Dimensions. TMA supports copying both one-dimensional and multi-dimensional arrays (up to 5-dimensional). The programming model for bulk-asynchronous copies of one-dimensional contiguous arrays is different from the programming model for bulk tensor asynchronous copies of multi-dimensional arrays. To perform a bulk tensor asynchronous copy of a multi-dimensional array, the hardware requires a tensor map. This object describes the layout of the multi-dimensional array in global and shared memory. A tensor map is created on the host using the cuTensorMapEncode API. The tensor map is transferred from host to device as a const kernel parameter annotated with __grid_constant__, and can be used on the device to copy a tile of data between shared and global memory. In contrast, performing a bulk-asynchronous copy of a contiguous one-dimensional array does not require a tensor map: it can be performed on-device with a pointer and size parameter.
维度。TMA 支持复制一维和多维数组(最多 5 维)。一维连续数组的批量异步复制的编程模型与多维数组的批量张量异步复制的编程模型不同。要执行多维数组的批量张量异步复制,硬件需要一个张量映射。该对象描述了全局和共享内存中多维数组的布局。使用 cuTensorMapEncode API 在主机上创建张量映射。张量映射作为 const 内核参数从主机传输到设备,并可在设备上用于在共享内存和全局内存之间复制数据块。相比之下,执行连续一维数组的批量异步复制不需要张量映射:它可以在设备上使用指针和大小参数执行。

Source and destination. The source and destination addresses of bulk-asynchronous copy operations can be in shared or global memory. The operations can read data from global to shared memory, write data from shared to global memory, and also copy from shared memory to Distributed Shared Memory of another block in the same cluster. In addition, when in a cluster, a bulk-asynchronous operation can be specified as being multicast. In this case, data can be transferred from global memory to the shared memory of multiple blocks within the cluster. The multicast feature is optimized for target architecture sm_90a and may have significantly reduced performance on other targets. Hence, it is advised to be used with compute architecture sm_90a.
源和目的地。批量异步复制操作的源和目的地地址可以位于共享或全局内存中。这些操作可以从全局内存读取数据到共享内存,从共享内存写入数据到全局内存,还可以将数据从共享内存复制到同一集群中另一个块的分布式共享内存中。此外,在集群中时,可以将批量异步操作指定为多播。在这种情况下,数据可以从全局内存传输到集群内多个块的共享内存中。多播功能针对目标架构 sm_90a 进行了优化,可能会在其他目标上显著降低性能。因此,建议在计算架构 sm_90a 中使用。

Asynchronous. Data transfers using TMA are asynchronous. This allows the initiating thread to continue computing while the hardware asynchronously copies the data. Whether the data transfer occurs asynchronously in practice is up to the hardware implementation and may change in the future. There are several completion mechanisms that bulk-asynchronous operations can use to signal that they have completed. When the operation reads from global to shared memory, any thread in the block can wait for the data to be readable in shared memory by waiting on a Shared Memory Barrier. When the bulk-asynchronous operation writes data from shared memory to global or distributed shared memory, only the initiating thread can wait for the operation to have completed. This is accomplished using a bulk async-group based completion mechanism. A table describing the completion mechanisms can be found below and in the PTX ISA.
异步。使用 TMA 的数据传输是异步的。这使得发起线程可以在硬件异步复制数据的同时继续计算。数据传输是否在实践中异步进行取决于硬件实现,并且可能会在将来发生变化。有几种完成机制可以用于信号化批量异步操作已完成。当操作从全局内存读取到共享内存时,块中的任何线程都可以通过等待共享内存屏障上的数据可读来等待数据。当批量异步操作将数据从共享内存写入到全局或分布式共享内存时,只有发起线程可以等待操作完成。这是通过使用基于批量异步组的完成机制来实现的。下表描述了完成机制,可以在下面和 PTX ISA 中找到。

Table 6 Asynchronous copies with possible source and destinations memory spaces and completion mechanisms. An empty cell indicates that a source-destination pair is not supported.
表 6 异步复制,可能具有源和目的地内存空间和完成机制。空单元格表示不支持源-目的地对。 

Direction 方向

Completion mechanism 完成机制

Destination ```markdown # Destination This is the translated text. ``` ```json { "source": "Destination", "translated": "目的地" } ```

Source 源码

Asychronous copy 异步复制

Bulk-asynchronous copy (TMA)
批量异步复制(TMA)

Global 全局

Global 全局

Global 全局

Shared::cta

Bulk async-group 批量异步组

Shared::cta

Global 全局

Async-group, mbarrier Async-group,mbarrier

Mbarrier

Shared::cluster

Global 全局

Mbarrier (multicast) Mbarrier(多播)

Shared::cta

Shared::cluster

Mbarrier

Shared::cta

Shared::cta

7.29.1. Using TMA to transfer one-dimensional arrays
7.29.1. 使用 TMA 传输一维数组 

This section demonstrates how to write a simple kernel that read-modify-writes a one-dimensional array using TMA. This shows how to how to load and store data using bulk-asynchronous copies, as well as how to synchronize threads of execution with those copies.
本节演示了如何编写一个简单的内核,使用 TMA 读取-修改-写入一维数组。这展示了如何使用批量异步拷贝加载和存储数据,以及如何使用这些拷贝来同步执行线程。

The code of the kernel is included below. Some functionality requires inline PTX assembly that is currently made available through libcu++. The availability of these wrappers can be checked with the following code:
内核代码包含在下面。某些功能需要内联 PTX 汇编,目前通过 libcu++提供。可以使用以下代码检查这些包装器的可用性:

#if defined(__CUDA_MINIMUM_ARCH__) && __CUDA_MINIMUM_ARCH__ < 900
static_assert(false, "Device code is being compiled with older architectures that are incompatible with TMA.");
#endif // __CUDA_MINIMUM_ARCH__

The kernel goes through the following stages:
内核经历以下阶段:

  1. Initialize shared memory barrier.
    初始化共享内存屏障。

  2. Initiate bulk-asynchronous copy of a block of memory from global to shared memory.
    启动从全局到共享内存的批量异步内存块复制。

  3. Arrive and wait on the shared memory barrier.
    到达并等待共享内存屏障。

  4. Increment the shared memory buffer values.
    增加共享内存缓冲区的值。

  5. Wait for shared memory writes to be visible to the subsequent bulk-asynchronous copy, i.e., order the shared memory writes in the async proxy before the next step.
    等待共享内存写入对后续的批量异步复制可见,即在下一步之前在异步代理中对共享内存进行排序写入。

  6. Initiate bulk-asynchronous copy of the buffer in shared memory to global memory.
    启动将共享内存中的缓冲区异步复制到全局内存。

  7. Wait at end of kernel for bulk-asynchronous copy to have finished reading shared memory.
    在内核末尾等待批量异步复制完成读取共享内存。

#include <cuda/barrier>
#include <cuda/ptx>
using barrier = cuda::barrier<cuda::thread_scope_block>;
namespace ptx = cuda::ptx;

static constexpr size_t buf_len = 1024;
__global__ void add_one_kernel(int* data, size_t offset)
{
  // Shared memory buffer. The destination shared memory buffer of
  // a bulk operations should be 16 byte aligned.
  __shared__ alignas(16) int smem_data[buf_len];

  // 1. a) Initialize shared memory barrier with the number of threads participating in the barrier.
  //    b) Make initialized barrier visible in async proxy.
  #pragma nv_diag_suppress static_var_with_dynamic_init
  __shared__ barrier bar;
  if (threadIdx.x == 0) { 
    init(&bar, blockDim.x);                      // a)
    ptx::fence_proxy_async(ptx::space_shared);   // b)
  }
  __syncthreads();

  // 2. Initiate TMA transfer to copy global to shared memory.
  if (threadIdx.x == 0) {
    // 3a. cuda::memcpy_async arrives on the barrier and communicates
    //     how many bytes are expected to come in (the transaction count)
    cuda::memcpy_async(
        smem_data, 
        data + offset, 
        cuda::aligned_size_t<16>(sizeof(smem_data)),
        bar
    );
  }
  // 3b. All threads arrive on the barrier
  barrier::arrival_token token = bar.arrive();
  
  // 3c. Wait for the data to have arrived.
  bar.wait(std::move(token));

  // 4. Compute saxpy and write back to shared memory
  for (int i = threadIdx.x; i < buf_len; i += blockDim.x) {
    smem_data[i] += 1;
  }

  // 5. Wait for shared memory writes to be visible to TMA engine.
  ptx::fence_proxy_async(ptx::space_shared);   // b)
  __syncthreads();
  // After syncthreads, writes by all threads are visible to TMA engine.

  // 6. Initiate TMA transfer to copy shared memory to global memory
  if (threadIdx.x == 0) {
    ptx::cp_async_bulk(
        ptx::space_global,
        ptx::space_shared,
        data + offset, smem_data, sizeof(smem_data));
    // 7. Wait for TMA transfer to have finished reading shared memory.
    // Create a "bulk async-group" out of the previous bulk copy operation.
    ptx::cp_async_bulk_commit_group();
    // Wait for the group to have completed reading from shared memory.
    ptx::cp_async_bulk_wait_group_read(ptx::n32_t<0>());
  }
}

Barrier initialization. The barrier is initialized with the number of threads participating in the block. As a result, the barrier will flip only if all threads have arrived on this barrier. Shared memory barriers are described in more detail in Asynchronous Data Copies using cuda::barrier. To make the initialized barrier visible to subsequent bulk-asynchronous copies, the fence.proxy.async.shared::cta instruction is used. This instruction ensures that subsequent bulk-asynchronous copy operations operate on the initialized barrier.
屏障初始化。屏障使用参与块的线程数进行初始化。因此,只有当所有线程都到达此屏障时,屏障才会翻转。有关共享内存屏障的详细信息,请参阅使用 cuda::barrier 进行异步数据复制。为了使初始化的屏障对后续的批量异步复制可见,使用 fence.proxy.async.shared::cta 指令。此指令确保后续的批量异步复制操作在初始化的屏障上运行。

TMA read. The bulk-asynchronous copy instruction directs the hardware to copy a large chunk of data into shared memory, and to update the transaction count of the shared memory barrier after completing the read. In general, issuing as few bulk copies with as big a size as possible results in the best performance. Because the copy can be performed asynchronously by the hardware, it is not necessary to split the copy into smaller chunks.
TMA 读取。批量异步复制指令指示硬件将大块数据复制到共享内存,并在完成读取后更新共享内存屏障的事务计数。通常,尽可能少地发出具有尽可能大尺寸的批量复制会获得最佳性能。由于硬件可以异步执行复制,因此无需将复制分成较小的块。

The thread that initiates the bulk-asynchronous copy operation arrives at the barrier using mbarrier.expect_tx. This is automatically performed by cuda::memcpy_async. This tells the barrier that the thread has arrived and also how many bytes (tx / transactions) are expected to arrive. Only a single thread has to update the expected transaction count. If multiple threads update the transaction count, the expected transaction will be the sum of the updates. The barrier will only flip once all threads have arrived and all bytes have arrived. Once the barrier has flipped, the bytes are safe to read from shared memory, both by the threads as well as by subsequent bulk-asynchronous copies. More information about barrier transaction accounting can be found in the PTX ISA.
发起批量异步复制操作的线程使用 mbarrier.expect_tx 到达屏障。这是由 cuda::memcpy_async 自动执行的。这告诉屏障线程已到达,以及预计到达的字节数(tx / 交易)。只有一个线程必须更新预期的交易计数。如果多个线程更新交易计数,则预期交易将是更新的总和。屏障只有在所有线程到达并且所有字节到达后才会翻转。一旦屏障翻转,字节就可以安全地从共享内存中读取,既可以由线程读取,也可以由后续的批量异步复制读取。有关屏障交易会计的更多信息,请参阅 PTX ISA。

Barrier wait. Waiting for the barrier to flip is done using mbarrier.try_wait. It can either return true, indicating that the wait is over, or return false, which may mean that the wait timed out. The while loop waits for completion, and retries on time-out.
屏障等待。等待屏障翻转使用 mbarrier.try_wait 完成。它可以返回 true,表示等待结束,也可以返回 false,这可能意味着等待超时。while 循环等待完成,并在超时时重试。

SMEM write and sync. The increment of the buffer values reads and writes to shared memory. To make the writes visible to subsequent bulk-asynchronous copies, the fence.proxy.async.shared::cta instruction is used. This orders the writes to shared memory before subsequent reads from bulk-asynchronous copy operations, which read through the async proxy. So each thread first orders the writes to objects in shared memory in the async proxy via the fence.proxy.async.shared::cta, and these operations by all threads are ordered before the async operation performed in thread 0 using __syncthreads().
SMEM 写入和同步。缓冲区值的增量读取和写入到共享内存。为了使写入对后续的批量异步复制可见,使用 fence.proxy.async.shared::cta 指令。这将在后续从批量异步复制操作读取之前对共享内存的写入进行排序,后者通过异步代理进行读取。因此,每个线程首先通过 fence.proxy.async.shared::cta 对共享内存中对象的写入进行排序,然后在线程 0 中执行的异步操作之前,所有线程的这些操作都会通过 __syncthreads() 进行排序。

TMA write and sync. The write from shared to global memory is again initiated by a single thread. The completion of the write is not tracked by a shared memory barrier. Instead, a thread-local mechanism is used. Multiple writes can be batched into a so-called bulk async-group. Afterwards, the thread can wait for all operations in this group to have completed reading from shared memory (as in the code above) or to have completed writing to global memory, making the writes visible to the initiating thread. For more information, refer to the PTX ISA documentation of cp.async.bulk.wait_group. Note that the bulk-asynchronous and non-bulk asynchronous copy instructions have different async-groups: there exist both cp.async.wait_group and cp.async.bulk.wait_group instructions.
TMA 写入和同步。从共享内存到全局内存的写入再次由单个线程发起。写入的完成不是通过共享内存屏障跟踪的。相反,使用线程本地机制。多个写入可以批量处理成所谓的批量异步组。然后,线程可以等待此组中的所有操作已完成从共享内存读取(如上面的代码)或已完成写入全局内存,使写入对发起线程可见。有关更多信息,请参阅 cp.async.bulk.wait_group 的 PTX ISA 文档。请注意,批量异步和非批量异步复制指令具有不同的异步组:存在 cp.async.wait_groupcp.async.bulk.wait_group 指令。

The bulk-asynchronous instructions have specific alignment requirements on their source and destination addresses. More information can be found in the table below.
批量异步指令对其源和目的地址有特定的对齐要求。更多信息请参阅下表。

Table 7 Alignment requirements for one-dimensional bulk-asynchronous operations in Compute Capability 9.0.
表 7 计算能力 9.0 中一维批量异步操作的对齐要求。 

Address / Size 地址/大小

Alignment 对齐

Global memory address 全局内存地址

Must be 16 byte aligned.
必须是 16 字节对齐的。

Shared memory address 共享内存地址

Must be 16 byte aligned.
必须是 16 字节对齐的。

Shared memory barrier address
共享内存屏障地址

Must be 8 byte aligned (this is guaranteed by cuda::barrier).
必须是 8 字节对齐(这由 cuda::barrier 保证)。

Size of transfer 传输大小

Must be a multiple of 16 bytes.
必须是 16 字节的倍数。

7.29.2. Using TMA to transfer multi-dimensional arrays
7.29.2. 使用 TMA 传输多维数组 

The primary difference between the one-dimensional and multi-dimensional case is that a tensor map must be created on the host and passed to the CUDA kernel. This section describes how to create a tensor map using the CUDA driver API, how to pass it to device, and how to use it on device.
一维和多维情况之间的主要区别在于必须在主机上创建一个张量映射并将其传递给 CUDA 内核。本节描述了如何使用 CUDA 驱动程序 API 创建张量映射,如何将其传递到设备以及如何在设备上使用它。

Driver API. A tensor map is created using the cuTensorMapEncodeTiled driver API. This API can be accessed by linking to the driver directly (-lcuda) or by using the cudaGetDriverEntryPoint API. Below, we show how to get a pointer to the cuTensorMapEncodeTiled API. For more information, refer to Driver Entry Point Access.
驱动程序 API。使用 cuTensorMapEncodeTiled 驱动程序 API 创建张量映射。可以通过直接链接到驱动程序( -lcuda )或使用 cudaGetDriverEntryPoint API 访问此 API。下面,我们展示如何获取指向 cuTensorMapEncodeTiled API 的指针。有关更多信息,请参阅驱动程序入口点访问。

#include <cudaTypedefs.h> // PFN_cuTensorMapEncodeTiled, CUtensorMap

PFN_cuTensorMapEncodeTiled_v12000 get_cuTensorMapEncodeTiled() {
  // Get pointer to cuGetProcAddress
  cudaDriverEntryPointQueryResult driver_status;
  void* cuGetProcAddress_ptr = nullptr;
  CUDA_CHECK(cudaGetDriverEntryPoint("cuGetProcAddress", &cuGetProcAddress_ptr, cudaEnableDefault, &driver_status));
  assert(driver_status == cudaDriverEntryPointSuccess);
  PFN_cuGetProcAddress_v12000 cuGetProcAddress = reinterpret_cast<PFN_cuGetProcAddress_v12000>(cuGetProcAddress_ptr);

  // Use cuGetProcAddress to get a pointer to the CTK 12.0 version of cuTensorMapEncodeTiled
  CUdriverProcAddressQueryResult symbol_status;
  void* cuTensorMapEncodeTiled_ptr = nullptr;
  CUresult res = cuGetProcAddress("cuTensorMapEncodeTiled", &cuTensorMapEncodeTiled_ptr, 12000, CU_GET_PROC_ADDRESS_DEFAULT, &symbol_status);
  assert(res == CUDA_SUCCESS && symbol_status == CU_GET_PROC_ADDRESS_SUCCESS);

  return reinterpret_cast<PFN_cuTensorMapEncodeTiled_v12000>(cuTensorMapEncodeTiled_ptr);
}

Creation. Creating a tensor map requires many parameters. Among them are the base pointer to an array in global memory, the size of the array (in number of elements), the stride from one row to the next (in bytes), the size of the shared memory buffer (in number of elements). The code below creates a tensor map to describe a two-dimensional row-major array of size GMEM_HEIGHT x GMEM_WIDTH. Note the order of the parameters: the fastest moving dimension comes first.
创建。创建张量映射需要许多参数。其中包括全局内存中数组的基指针、数组的大小(元素数量)、从一行到下一行的跨度(字节)、共享内存缓冲区的大小(元素数量)。下面的代码创建一个张量映射,用于描述大小为 GMEM_HEIGHT x GMEM_WIDTH 的二维行主数组。请注意参数的顺序:最快移动的维度首先出现。

  CUtensorMap tensor_map{};
  // rank is the number of dimensions of the array.
  constexpr uint32_t rank = 2;
  uint64_t size[rank] = {GMEM_WIDTH, GMEM_HEIGHT};
  // The stride is the number of bytes to traverse from the first element of one row to the next.
  // It must be a multiple of 16.
  uint64_t stride[rank - 1] = {GMEM_WIDTH * sizeof(int)};
  // The box_size is the size of the shared memory buffer that is used as the
  // destination of a TMA transfer.
  uint32_t box_size[rank] = {SMEM_WIDTH, SMEM_HEIGHT};
  // The distance between elements in units of sizeof(element). A stride of 2
  // can be used to load only the real component of a complex-valued tensor, for instance.
  uint32_t elem_stride[rank] = {1, 1};

  // Get a function pointer to the cuTensorMapEncodeTiled driver API.
  auto cuTensorMapEncodeTiled = get_cuTensorMapEncodeTiled();

  // Create the tensor descriptor.
  CUresult res = cuTensorMapEncodeTiled(
    &tensor_map,                // CUtensorMap *tensorMap,
    CUtensorMapDataType::CU_TENSOR_MAP_DATA_TYPE_INT32,
    rank,                       // cuuint32_t tensorRank,
    tensor_ptr,                 // void *globalAddress,
    size,                       // const cuuint64_t *globalDim,
    stride,                     // const cuuint64_t *globalStrides,
    box_size,                   // const cuuint32_t *boxDim,
    elem_stride,                // const cuuint32_t *elementStrides,
    // Interleave patterns can be used to accelerate loading of values that
    // are less than 4 bytes long.
    CUtensorMapInterleave::CU_TENSOR_MAP_INTERLEAVE_NONE,
    // Swizzling can be used to avoid shared memory bank conflicts.
    CUtensorMapSwizzle::CU_TENSOR_MAP_SWIZZLE_NONE,
    // L2 Promotion can be used to widen the effect of a cache-policy to a wider
    // set of L2 cache lines.
    CUtensorMapL2promotion::CU_TENSOR_MAP_L2_PROMOTION_NONE,
    // Any element that is outside of bounds will be set to zero by the TMA transfer.
    CUtensorMapFloatOOBfill::CU_TENSOR_MAP_FLOAT_OOB_FILL_NONE
  );

Host-to-device transfer. A bulk tensor asynchronous operations require the tensor map to be in immutable memory. This can be achieved by using constant memory or by passing the tensor map as a const __grid_constant__ parameter to a kernel. When passing the tensor map as a parameter, some versions of the GCC C++ compiler issue the warning “the ABI for passing parameters with 64-byte alignment has changed in GCC 4.6”. This warning can be ignored.
主机到设备的传输。批量张量异步操作需要张量映射位于不可变内存中。可以通过使用 constant 内存或将张量映射作为 const __grid_constant__ 参数传递给内核来实现这一点。当将张量映射作为参数传递时,某些版本的 GCC C++编译器会发出警告“在 GCC 4.6 中传递具有 64 字节对齐的参数的 ABI 已更改”。可以忽略此警告。

__global__ void kernel(const __grid_constant__ CUtensorMap tensor_map)
{
  // Use tensor_map here.
}
int main() {
  CUtensorMap map;
  // [ ..Initialize map.. ]
  kernel<<<1, 1>>>(map);
}

As an alternative to the __grid_constant__ kernel parameter, a global constant variable can be used. An example is included below.
作为 __grid_constant__ 内核参数的替代方案,可以使用全局常量变量。以下是一个示例。

__constant__ CUtensorMap global_tensor_map;
__global__ void kernel()
{
  // Use global_tensor_map here.
}
int main() {
  CUtensorMap local_tensor_map;
  // [ ..Initialize map.. ]
  cudaMemcpyToSymbol(global_tensor_map, &local_tensor_map, sizeof(CUtensorMap));
  kernel<<<1, 1>>>();
}

The following example copies the tensor map to global device memory. Using a pointer to a tensor map in global device memory is undefined behavior and will lead to silent and difficult to track down bugs.
以下示例将张量映射复制到全局设备内存。在全局设备内存中使用张量映射的指针是未定义行为,会导致难以追踪的错误。

__device__ CUtensorMap global_tensor_map;
__global__ void kernel(CUtensorMap *tensor_map)
{
  // Do *not* use tensor_map here. Using a global memory pointer is
  // undefined behavior and can fail silently and unreliably.
}
int main() {
  CUtensorMap local_tensor_map;
  // [ ..Initialize map.. ]
  cudaMemcpy(global_tensor_map, &local_tensor_map, sizeof(CUtensorMap));
  kernel<<<1, 1>>>(global_tensor_map);
}

Use. The kernel below loads a 2D tile of size SMEM_HEIGHT x SMEM_WIDTH from a larger 2D array. The top-left corner of the tile is indicated by the indices x and y. The tile is loaded into shared memory, modified, and written back to global memory.
使用。下面的内核从较大的 2D 数组中加载大小为 SMEM_HEIGHT x SMEM_WIDTH 的 2D 瓦片。瓦片的左上角由索引 xy 指示。将瓦片加载到共享内存中,进行修改,然后写回全局内存。

#include <cuda.h>         // CUtensormap
#include <cuda/barrier>
using barrier = cuda::barrier<cuda::thread_scope_block>;
namespace cde = cuda::device::experimental;

__global__ void kernel(const __grid_constant__ CUtensorMap tensor_map, int x, int y) {
  // The destination shared memory buffer of a bulk tensor operation should be
  // 128 byte aligned.
  __shared__ alignas(128) int smem_buffer[SMEM_HEIGHT][SMEM_WIDTH];

  // Initialize shared memory barrier with the number of threads participating in the barrier.
  #pragma nv_diag_suppress static_var_with_dynamic_init
  __shared__ barrier bar;

  if (threadIdx.x == 0) {
    // Initialize barrier. All `blockDim.x` threads in block participate.
    init(&bar, blockDim.x);
    // Make initialized barrier visible in async proxy.
    cde::fence_proxy_async_shared_cta();    
  }
  // Syncthreads so initialized barrier is visible to all threads.
  __syncthreads();

  barrier::arrival_token token;
  if (threadIdx.x == 0) {
    // Initiate bulk tensor copy.
    cde::cp_async_bulk_tensor_2d_global_to_shared(&smem_buffer, &tensor_map, x, y, bar);
    // Arrive on the barrier and tell how many bytes are expected to come in.
    token = cuda::device::barrier_arrive_tx(bar, 1, sizeof(smem_buffer));
  } else {
    // Other threads just arrive.
    token = bar.arrive();
  }
  // Wait for the data to have arrived.
  bar.wait(std::move(token));

  // Symbolically modify a value in shared memory.
  smem_buffer[0][threadIdx.x] += threadIdx.x;

  // Wait for shared memory writes to be visible to TMA engine.
  cde::fence_proxy_async_shared_cta();
  __syncthreads();
  // After syncthreads, writes by all threads are visible to TMA engine.

  // Initiate TMA transfer to copy shared memory to global memory
  if (threadIdx.x == 0) {
    cde::cp_async_bulk_tensor_2d_shared_to_global(&tensor_map, x, y, &smem_buffer);
    // Wait for TMA transfer to have finished reading shared memory.
    // Create a "bulk async-group" out of the previous bulk copy operation.
    cde::cp_async_bulk_commit_group();
    // Wait for the group to have completed reading from shared memory.
    cde::cp_async_bulk_wait_group_read<0>();
  }

  // Destroy barrier. This invalidates the memory region of the barrier. If
  // further computations were to take place in the kernel, this allows the
  // memory location of the shared memory barrier to be reused.
  if (threadIdx.x == 0) {
    (&bar)->~barrier();
  }
}

Negative indices and out of bounds. When part of the tile that is being read from global to shared memory is out of bounds, the shared memory that corresponds to the out of bounds area is zero-filled. The top-left corner indices of the tile may also be negative. When writing from shared to global memory, parts of the tile may be out of bounds, but the top left corner cannot have any negative indices.
负索引和越界。当从全局读取到共享内存的瓦片的一部分越界时,与越界区域对应的共享内存将被填充为零。瓦片的左上角索引也可能为负。当从共享内存写入到全局内存时,瓦片的部分可能越界,但左上角不能有任何负索引。

Size and stride. The size of a tensor is the number of elements along one dimension. All sizes must be greater than one. The stride is the number of bytes between elements of the same dimension. For instance, a 4 x 4 matrix of integers has sizes 4 and 4. Since it has 4 bytes per element, the strides are 4 and 16 bytes. Due to alignment requirements, a 4 x 3 row-major matrix of integers must have strides of 4 and 16 bytes as well. Each row is padded with 4 extra bytes to ensure that the start of the next row is aligned to 16 bytes. For more information regarding alignment, refer to Table Alignment requirements for multi-dimensional bulk tensor asynchronous copy operations in Compute Capability 9.0..
尺寸和步幅。张量的尺寸是沿着一个维度的元素数量。所有尺寸必须大于 1。步幅是同一维度元素之间的字节数量。例如,一个整数 4 x 4 矩阵的尺寸为 4 和 4。由于每个元素有 4 字节,步幅分别为 4 和 16 字节。由于对齐要求,整数 4 x 3 行主要矩阵的步幅也必须为 4 和 16 字节。每行都会填充 4 个额外字节,以确保下一行的起始位置对齐到 16 字节。有关对齐的更多信息,请参阅《计算能力 9.0 中多维批量张量异步复制操作的表对齐要求》。

Table 8 Alignment requirements for multi-dimensional bulk tensor asynchronous copy operations in Compute Capability 9.0.
表 8 在计算能力 9.0 中,多维批量张量异步复制操作的对齐要求。 

Address / Size 地址/大小

Alignment 对齐

Global memory address 全局内存地址

Must be 16 byte aligned.
必须是 16 字节对齐的。

Global memory sizes 全局内存大小

Must be greater than or equal to one. Does not have to be a multiple of 16 bytes.
必须大于或等于一。不必是 16 字节的倍数。

Global memory strides 全局内存步幅

Must be multiples of 16 bytes.
必须是 16 字节的倍数。

Shared memory address 共享内存地址

Must be 128 byte aligned.
必须是 128 字节对齐的。

Shared memory barrier address
共享内存屏障地址

Must be 8 byte aligned (this is guaranteed by cuda::barrier).
必须是 8 字节对齐(这由 cuda::barrier 保证)。

Size of transfer 传输大小

Must be a multiple of 16 bytes.
必须是 16 字节的倍数。

7.29.2.1. Multi-dimensional TMA PTX wrappers
7.29.2.1. 多维 TMA PTX 包装器 

Below, the PTX instructions are ordered by their use in the example code above.
下面,PTX 指令按照它们在上面示例代码中的使用顺序排列。

The cp.async.bulk.tensor instructions initiate a bulk tensor asynchronous copy between global and shared memory. The wrappers below read from global to shared memory and write from shared to global memory.
cp.async.bulk.tensor 指令启动全局内存和共享内存之间的批量张量异步复制。下面的包装器从全局内存读取到共享内存,并从共享内存写入到全局内存。

// https://docs.nvidia.com/cuda/parallel-thread-execution/index.html#data-movement-and-conversion-instructions-cp-async-bulk-tensor
inline __device__
void cuda::device::experimental::cp_async_bulk_tensor_1d_global_to_shared(
    void *dest, const CUtensorMap *tensor_map , int c0, cuda::barrier<cuda::thread_scope_block> &bar
);

// https://docs.nvidia.com/cuda/parallel-thread-execution/index.html#data-movement-and-conversion-instructions-cp-async-bulk-tensor
inline __device__
void cuda::device::experimental::cp_async_bulk_tensor_2d_global_to_shared(
    void *dest, const CUtensorMap *tensor_map , int c0, int c1, cuda::barrier<cuda::thread_scope_block> &bar
);

// https://docs.nvidia.com/cuda/parallel-thread-execution/index.html#data-movement-and-conversion-instructions-cp-async-bulk-tensor
inline __device__
void cuda::device::experimental::cp_async_bulk_tensor_3d_global_to_shared(
    void *dest, const CUtensorMap *tensor_map, int c0, int c1, int c2, cuda::barrier<cuda::thread_scope_block> &bar
);

// https://docs.nvidia.com/cuda/parallel-thread-execution/index.html#data-movement-and-conversion-instructions-cp-async-bulk-tensor
inline __device__
void cuda::device::experimental::cp_async_bulk_tensor_4d_global_to_shared(
    void *dest, const CUtensorMap *tensor_map , int c0, int c1, int c2, int c3, cuda::barrier<cuda::thread_scope_block> &bar
);

// https://docs.nvidia.com/cuda/parallel-thread-execution/index.html#data-movement-and-conversion-instructions-cp-async-bulk-tensor
inline __device__
void cuda::device::experimental::cp_async_bulk_tensor_5d_global_to_shared(
    void *dest, const CUtensorMap *tensor_map , int c0, int c1, int c2, int c3, int c4, cuda::barrier<cuda::thread_scope_block> &bar
);
// https://docs.nvidia.com/cuda/parallel-thread-execution/index.html#data-movement-and-conversion-instructions-cp-async-bulk-tensor
inline __device__
void cuda::device::experimental::cp_async_bulk_tensor_1d_shared_to_global(
    const CUtensorMap *tensor_map, int c0, const void *src
);

// https://docs.nvidia.com/cuda/parallel-thread-execution/index.html#data-movement-and-conversion-instructions-cp-async-bulk-tensor
inline __device__
void cuda::device::experimental::cp_async_bulk_tensor_2d_shared_to_global(
    const CUtensorMap *tensor_map, int c0, int c1, const void *src
);

// https://docs.nvidia.com/cuda/parallel-thread-execution/index.html#data-movement-and-conversion-instructions-cp-async-bulk-tensor
inline __device__
void cuda::device::experimental::cp_async_bulk_tensor_3d_shared_to_global(
    const CUtensorMap *tensor_map, int c0, int c1, int c2, const void *src
);

// https://docs.nvidia.com/cuda/parallel-thread-execution/index.html#data-movement-and-conversion-instructions-cp-async-bulk-tensor
inline __device__
void cuda::device::experimental::cp_async_bulk_tensor_4d_shared_to_global(
    const CUtensorMap *tensor_map, int c0, int c1, int c2, int c3, const void *src
);

// https://docs.nvidia.com/cuda/parallel-thread-execution/index.html#data-movement-and-conversion-instructions-cp-async-bulk-tensor
inline __device__
void cuda::device::experimental::cp_async_bulk_tensor_5d_shared_to_global(
    const CUtensorMap *tensor_map, int c0, int c1, int c2, int c3, int c4, const void *src
);

7.30. Profiler Counter Function
7.30. 性能分析器计数器功能 

Each multiprocessor has a set of sixteen hardware counters that an application can increment with a single instruction by calling the __prof_trigger() function.
每个多处理器都有一组十六个硬件计数器,应用程序可以通过调用 __prof_trigger() 函数使用单个指令递增这些计数器。

void __prof_trigger(int counter);

increments by one per warp the per-multiprocessor hardware counter of index counter. Counters 8 to 15 are reserved and should not be used by applications.
每个 warp 的每个多处理器硬件计数器的索引 counter 递增 1。计数器 8 到 15 为保留,不应被应用程序使用。

The value of counters 0, 1, …, 7 can be obtained via nvprof by nvprof --events prof_trigger_0x where x is 0, 1, …, 7. All counters are reset before each kernel launch (note that when collecting counters, kernel launches are synchronous as mentioned in Concurrent Execution between Host and Device).
计数器 0、1、...、7 的值可以通过 nvprofnvprof --events prof_trigger_0x 获得,其中 x 为 0、1、...、7。在每次内核启动之前都会重置所有计数器(请注意,在收集计数器时,内核启动是同步的,如在主机和设备之间的并发执行中所述)。

7.31. Assertion 7.31. 断言 

Assertion is only supported by devices of compute capability 2.x and higher.
断言仅受支持的设备为计算能力为 2.x 及更高版本。

void assert(int expression);

stops the kernel execution if expression is equal to zero. If the program is run within a debugger, this triggers a breakpoint and the debugger can be used to inspect the current state of the device. Otherwise, each thread for which expression is equal to zero prints a message to stderr after synchronization with the host via cudaDeviceSynchronize(), cudaStreamSynchronize(), or cudaEventSynchronize(). The format of this message is as follows:
如果 expression 等于零,则停止内核执行。如果程序在调试器中运行,则会触发断点,调试器可以用于检查设备的当前状态。否则,对于 expression 等于零的每个线程,在通过 cudaDeviceSynchronize()cudaStreamSynchronize()cudaEventSynchronize() 与主机同步后,会向 stderr 打印一条消息。此消息的格式如下:

<filename>:<line number>:<function>:
block: [blockId.x,blockId.x,blockIdx.z],
thread: [threadIdx.x,threadIdx.y,threadIdx.z]
Assertion `<expression>` failed.

Any subsequent host-side synchronization calls made for the same device will return cudaErrorAssert. No more commands can be sent to this device until cudaDeviceReset() is called to reinitialize the device.
对于同一设备进行的任何后续主机端同步调用都将返回 cudaErrorAssert 。在调用 cudaDeviceReset() 重新初始化设备之前,无法向该设备发送更多命令。

If expression is different from zero, the kernel execution is unaffected.
如果 expression 不等于零,则内核执行不受影响。

For example, the following program from source file test.cu
例如,来自源文件 test.cu 的以下程序

#include <assert.h>

__global__ void testAssert(void)
{
    int is_one = 1;
    int should_be_one = 0;

    // This will have no effect
    assert(is_one);

    // This will halt kernel execution
    assert(should_be_one);
}

int main(int argc, char* argv[])
{
    testAssert<<<1,1>>>();
    cudaDeviceSynchronize();

    return 0;
}

will output: 将输出:

test.cu:19: void testAssert(): block: [0,0,0], thread: [0,0,0] Assertion `should_be_one` failed.

Assertions are for debugging purposes. They can affect performance and it is therefore recommended to disable them in production code. They can be disabled at compile time by defining the NDEBUG preprocessor macro before including assert.h. Note that expression should not be an expression with side effects (something like(++i > 0), for example), otherwise disabling the assertion will affect the functionality of the code.
断言用于调试目的。它们可能会影响性能,因此建议在生产代码中禁用它们。可以通过在包含 assert.h 之前定义 NDEBUG 预处理宏来在编译时禁用它们。请注意, expression 不应该是具有副作用的表达式(例如 (++i > 0) 之类的),否则禁用断言将影响代码的功能。

7.32. Trap function
7.32. 陷阱函数 

A trap operation can be initiated by calling the __trap() function from any device thread.
可以通过从任何设备线程调用 __trap() 函数来启动陷阱操作。

void __trap();

The execution of the kernel is aborted and an interrupt is raised in the host program.
内核的执行被中止,并在主机程序中引发中断。

7.33. Breakpoint Function
7.33. 断点功能 

Execution of a kernel function can be suspended by calling the __brkpt() function from any device thread.
通过从任何设备线程调用 __brkpt() 函数,可以暂停内核函数的执行。

void __brkpt();

7.34. Formatted Output
7.34. 格式化输出 

Formatted output is only supported by devices of compute capability 2.x and higher.
格式化输出仅受支持于计算能力为 2.x 及更高的设备。

int printf(const char *format[, arg, ...]);

prints formatted output from a kernel to a host-side output stream.
将格式化输出从内核打印到主机端输出流。

The in-kernel printf() function behaves in a similar way to the standard C-library printf() function, and the user is referred to the host system’s manual pages for a complete description of printf() behavior. In essence, the string passed in as format is output to a stream on the host, with substitutions made from the argument list wherever a format specifier is encountered. Supported format specifiers are listed below.
内核中的 printf() 函数的行为方式类似于标准 C 库中的 printf() 函数,用户应参考主机系统的手册页面以获取 printf() 行为的完整描述。本质上,作为 format 传递的字符串将输出到主机上的流中,在遇到格式说明符时,将从参数列表中进行替换。支持的格式说明符如下所示。

The printf() command is executed as any other device-side function: per-thread, and in the context of the calling thread. From a multi-threaded kernel, this means that a straightforward call to printf() will be executed by every thread, using that thread’s data as specified. Multiple versions of the output string will then appear at the host stream, once for each thread which encountered the printf().
printf() 命令像其他设备端函数一样执行:每个线程一次,并在调用线程的上下文中执行。在多线程内核中,这意味着对 printf() 的简单调用将由每个线程执行,使用该线程指定的数据。然后,输出字符串的多个版本将出现在主机流中,每个遇到 printf() 的线程都会出现一次。

It is up to the programmer to limit the output to a single thread if only a single output string is desired (see Examples for an illustrative example).
由程序员来决定是否将输出限制为单个线程,如果只需要一个输出字符串(请参见示例以获取说明性示例)。

Unlike the C-standard printf(), which returns the number of characters printed, CUDA’s printf() returns the number of arguments parsed. If no arguments follow the format string, 0 is returned. If the format string is NULL, -1 is returned. If an internal error occurs, -2 is returned.
与返回打印字符数的 C 标准 printf() 不同,CUDA 的 printf() 返回解析的参数数。如果在格式字符串后没有参数,则返回 0。如果格式字符串为 NULL,则返回 -1。如果发生内部错误,则返回 -2。

7.34.1. Format Specifiers
7.34.1. 格式说明符 

As for standard printf(), format specifiers take the form: %[flags][width][.precision][size]type
对于标准 printf() ,格式说明符采用以下形式: %[flags][width][.precision][size]type

The following fields are supported (see widely-available documentation for a complete description of all behaviors):
支持以下字段(请参阅广泛可用的文档,了解所有行为的完整描述):

  • Flags: '#' ' ' '0' '+' '-' 标志: '#' ' ' '0' '+' '-'

  • Width: '*' '0-9' 宽度: '*' '0-9'

  • Precision: '0-9' 精度: '0-9'

  • Size: 'h' 'l' 'll' 大小: 'h' 'l' 'll'

  • Type: "%cdiouxXpeEfgGaAs" 类型: "%cdiouxXpeEfgGaAs"

Note that CUDA’s printf()will accept any combination of flag, width, precision, size and type, whether or not overall they form a valid format specifier. In other words, “%hd” will be accepted and printf will expect a double-precision variable in the corresponding location in the argument list.
请注意,CUDA 的 printf() 将接受任何组合的标志、宽度、精度、大小和类型,无论它们是否总体上形成有效的格式说明符。换句话说,“ %hd ”将被接受,并且 printf 将期望在参数列表中的相应位置有一个双精度变量。

7.34.2. Limitations 7.34.2. 限制 

Final formatting of the printf()output takes place on the host system. This means that the format string must be understood by the host-system’s compiler and C library. Every effort has been made to ensure that the format specifiers supported by CUDA’s printf function form a universal subset from the most common host compilers, but exact behavior will be host-OS-dependent.
最终对 printf() 输出的格式化在主机系统上进行。这意味着格式字符串必须被主机系统的编译器和 C 库理解。已经尽一切努力确保 CUDA 的 printf 函数支持的格式说明符形成了来自最常见主机编译器的通用子集,但确切的行为将取决于主机操作系统。

As described in Format Specifiers, printf() will accept all combinations of valid flags and types. This is because it cannot determine what will and will not be valid on the host system where the final output is formatted. The effect of this is that output may be undefined if the program emits a format string which contains invalid combinations.
如格式说明中所述, printf() 将接受所有有效标志和类型的组合。这是因为它无法确定在格式化最终输出的主机系统上什么是有效的,什么是无效的。这样做的效果是,如果程序发出包含无效组合的格式字符串,则输出可能是未定义的。

The printf() command can accept at most 32 arguments in addition to the format string. Additional arguments beyond this will be ignored, and the format specifier output as-is.
printf() 命令最多可以接受 32 个参数,除了格式字符串。 超出此范围的额外参数将被忽略,并且格式说明符将按原样输出。

Owing to the differing size of the long type on 64-bit Windows platforms (four bytes on 64-bit Windows platforms, eight bytes on other 64-bit platforms), a kernel which is compiled on a non-Windows 64-bit machine but then run on a win64 machine will see corrupted output for all format strings which include “%ld”. It is recommended that the compilation platform matches the execution platform to ensure safety.
由于在 64 位 Windows 平台上 long 类型的大小不同(64 位 Windows 平台上为四个字节,在其他 64 位平台上为八个字节),在非 Windows 64 位机器上编译但在 win64 机器上运行的内核将看到所有包含“ %ld ”的格式字符串的损坏输出。建议编译平台与执行平台匹配,以确保安全。

The output buffer for printf() is set to a fixed size before kernel launch (see Associated Host-Side API). It is circular and if more output is produced during kernel execution than can fit in the buffer, older output is overwritten. It is flushed only when one of these actions is performed:
printf() 的输出缓冲区在内核启动之前设置为固定大小(请参阅相关的主机端 API)。它是循环的,如果在内核执行期间产生的输出超过缓冲区的容量,旧的输出将被覆盖。只有在执行以下操作之一时才会刷新:

  • Kernel launch via <<<>>> or cuLaunchKernel() (at the start of the launch, and if the CUDA_LAUNCH_BLOCKING environment variable is set to 1, at the end of the launch as well),
    通过 <<<>>>cuLaunchKernel() 启动内核(在启动开始时,如果 CUDA_LAUNCH_BLOCKING 环境变量设置为 1,则在启动结束时也是如此),

  • Synchronization via cudaDeviceSynchronize(), cuCtxSynchronize(), cudaStreamSynchronize(), cuStreamSynchronize(), cudaEventSynchronize(), or cuEventSynchronize(),
    通过 cudaDeviceSynchronize()cuCtxSynchronize()cudaStreamSynchronize()cuStreamSynchronize()cudaEventSynchronize()cuEventSynchronize() 进行同步,

  • Memory copies via any blocking version of cudaMemcpy*() or cuMemcpy*(),
    通过任何阻塞版本的 cudaMemcpy*()cuMemcpy*() 进行内存复制,

  • Module loading/unloading via cuModuleLoad() or cuModuleUnload(),
    通过 cuModuleLoad()cuModuleUnload() 进行模块加载/卸载,

  • Context destruction via cudaDeviceReset() or cuCtxDestroy().
    通过 cudaDeviceReset()cuCtxDestroy() 进行上下文销毁。

  • Prior to executing a stream callback added by cudaStreamAddCallback or cuStreamAddCallback.
    在执行由 cudaStreamAddCallbackcuStreamAddCallback 添加的流回调之前。

Note that the buffer is not flushed automatically when the program exits. The user must call cudaDeviceReset() or cuCtxDestroy() explicitly, as shown in the examples below.
请注意,当程序退出时,缓冲区不会自动刷新。用户必须显式调用 cudaDeviceReset()cuCtxDestroy() ,如下例所示。

Internally printf() uses a shared data structure and so it is possible that calling printf() might change the order of execution of threads. In particular, a thread which calls printf() might take a longer execution path than one which does not call printf(), and that path length is dependent upon the parameters of the printf(). Note, however, that CUDA makes no guarantees of thread execution order except at explicit __syncthreads() barriers, so it is impossible to tell whether execution order has been modified by printf() or by other scheduling behavior in the hardware.
内部 printf() 使用共享数据结构,因此调用 printf() 可能会改变线程执行顺序。特别是,调用 printf() 的线程可能比不调用 printf() 的线程执行路径更长,该路径长度取决于 printf() 的参数。但请注意,CUDA 除了在显式 __syncthreads() 屏障处不保证线程执行顺序,因此无法确定执行顺序是由 printf() 还是硬件中的其他调度行为修改的。

7.34.3. Associated Host-Side API
7.34.3. 关联的主机端 API 

The following API functions get and set the size of the buffer used to transfer the printf() arguments and internal metadata to the host (default is 1 megabyte):
以下 API 函数用于获取和设置用于将 printf() 参数和内部元数据传输到主机的缓冲区大小(默认为 1 兆字节):

  • cudaDeviceGetLimit(size_t* size,cudaLimitPrintfFifoSize)

  • cudaDeviceSetLimit(cudaLimitPrintfFifoSize, size_t size)

7.34.4. Examples 7.34.4. 示例 

The following code sample:
以下代码示例:

#include <stdio.h>

__global__ void helloCUDA(float f)
{
    printf("Hello thread %d, f=%f\n", threadIdx.x, f);
}

int main()
{
    helloCUDA<<<1, 5>>>(1.2345f);
    cudaDeviceSynchronize();
    return 0;
}

will output: 将输出:

Hello thread 2, f=1.2345
Hello thread 1, f=1.2345
Hello thread 4, f=1.2345
Hello thread 0, f=1.2345
Hello thread 3, f=1.2345

Notice how each thread encounters the printf() command, so there are as many lines of output as there were threads launched in the grid. As expected, global values (i.e., float f) are common between all threads, and local values (i.e., threadIdx.x) are distinct per-thread.
请注意每个线程都会遇到 printf() 命令,因此输出的行数与在网格中启动的线程数量相同。如预期,全局值(即 float f )在所有线程之间是共享的,而本地值(即 threadIdx.x )在每个线程中是独立的。

The following code sample:
以下代码示例:

#include <stdio.h>

__global__ void helloCUDA(float f)
{
    if (threadIdx.x == 0)
        printf("Hello thread %d, f=%f\n", threadIdx.x, f) ;
}

int main()
{
    helloCUDA<<<1, 5>>>(1.2345f);
    cudaDeviceSynchronize();
    return 0;
}

will output: 将输出:

Hello thread 0, f=1.2345

Self-evidently, the if() statement limits which threads will call printf, so that only a single line of output is seen.
显然, if() 语句限制了哪些线程将调用 printf ,因此只能看到一行输出。

7.35. Dynamic Global Memory Allocation and Operations
7.35. 动态全局内存分配和操作 

Dynamic global memory allocation and operations are only supported by devices of compute capability 2.x and higher.
动态全局内存分配和操作仅受支持的设备的计算能力为 2.x 及更高。

__host__ __device__ void* malloc(size_t size);
__device__ void *__nv_aligned_device_malloc(size_t size, size_t align);
__host__ __device__  void free(void* ptr);

allocate and free memory dynamically from a fixed-size heap in global memory.
在全局内存中的固定大小堆中动态分配和释放内存。

__host__ __device__ void* memcpy(void* dest, const void* src, size_t size);

copy size bytes from the memory location pointed by src to the memory location pointed by dest.
从由 src 指向的内存位置复制 size 字节到由 dest 指向的内存位置。

__host__ __device__ void* memset(void* ptr, int value, size_t size);

set size bytes of memory block pointed by ptr to value (interpreted as an unsigned char).
将由 ptr 指向的内存块的 size 字节设置为 value (解释为无符号字符)。

The CUDA in-kernel malloc()function allocates at least size bytes from the device heap and returns a pointer to the allocated memory or NULL if insufficient memory exists to fulfill the request. The returned pointer is guaranteed to be aligned to a 16-byte boundary.
CUDA 内核中的 malloc() 函数从设备堆中分配至少 size 字节的内存,并返回指向分配内存的指针,如果内存不足以满足请求,则返回 NULL。返回的指针保证对齐到 16 字节边界。

The CUDA in-kernel __nv_aligned_device_malloc() function allocates at least size bytes from the device heap and returns a pointer to the allocated memory or NULL if insufficient memory exists to fulfill the requested size or alignment. The address of the allocated memory will be a multiple of align. align must be a non-zero power of 2.
CUDA 内核中的 __nv_aligned_device_malloc() 函数从设备堆中分配至少 size 字节,并返回指向分配内存的指针,如果内存不足以满足请求的大小或对齐要求,则返回 NULL。分配内存的地址将是 align 的倍数。 align 必须是非零的 2 的幂。

The CUDA in-kernel free() function deallocates the memory pointed to by ptr, which must have been returned by a previous call to malloc() or __nv_aligned_device_malloc(). If ptr is NULL, the call to free() is ignored. Repeated calls to free() with the same ptr has undefined behavior.
CUDA 内核中的 free() 函数释放由 ptr 指向的内存,该内存必须是之前调用 malloc()__nv_aligned_device_malloc() 返回的。如果 ptr 为 NULL,则忽略对 free() 的调用。重复使用相同 ptr 调用 free() 的行为未定义。

The memory allocated by a given CUDA thread via malloc() or __nv_aligned_device_malloc() remains allocated for the lifetime of the CUDA context, or until it is explicitly released by a call to free(). It can be used by any other CUDA threads even from subsequent kernel launches. Any CUDA thread may free memory allocated by another thread, but care should be taken to ensure that the same pointer is not freed more than once.
给定的 CUDA 线程通过 malloc()__nv_aligned_device_malloc() 分配的内存在 CUDA 上下文的生命周期内保持分配,直到通过调用 free() 明确释放为止。即使来自后续内核启动的任何其他 CUDA 线程也可以使用它。任何 CUDA 线程都可以释放另一个线程分配的内存,但应注意确保同一指针不会被释放多次。

7.35.1. Heap Memory Allocation
7.35.1. 堆内存分配 

The device memory heap has a fixed size that must be specified before any program using malloc(), __nv_aligned_device_malloc() or free() is loaded into the context. A default heap of eight megabytes is allocated if any program uses malloc() or __nv_aligned_device_malloc() without explicitly specifying the heap size.
设备内存堆具有固定大小,在加载到上下文中使用 malloc()__nv_aligned_device_malloc()free() 的任何程序之前必须指定大小。如果任何程序在未明确指定堆大小的情况下使用 malloc()__nv_aligned_device_malloc() ,则会分配默认大小为八兆字节的堆。

The following API functions get and set the heap size:
以下 API 函数用于获取和设置堆大小:

  • cudaDeviceGetLimit(size_t* size, cudaLimitMallocHeapSize)

  • cudaDeviceSetLimit(cudaLimitMallocHeapSize, size_t size)

The heap size granted will be at least size bytes. cuCtxGetLimit()and cudaDeviceGetLimit() return the currently requested heap size.
授予的堆大小将至少为 size 字节。 cuCtxGetLimit()cudaDeviceGetLimit() 返回当前请求的堆大小。

The actual memory allocation for the heap occurs when a module is loaded into the context, either explicitly via the CUDA driver API (see Module), or implicitly via the CUDA runtime API (see CUDA Runtime). If the memory allocation fails, the module load will generate a CUDA_ERROR_SHARED_OBJECT_INIT_FAILED error.
堆的实际内存分配发生在模块加载到上下文时,可以通过 CUDA 驱动程序 API(请参阅模块)显式地加载,也可以通过 CUDA 运行时 API(请参阅 CUDA 运行时)隐式地加载。如果内存分配失败,模块加载将生成一个 CUDA_ERROR_SHARED_OBJECT_INIT_FAILED 错误。

Heap size cannot be changed once a module load has occurred and it does not resize dynamically according to need.
堆大小一旦发生模块加载就无法更改,并且不会根据需要动态调整大小。

Memory reserved for the device heap is in addition to memory allocated through host-side CUDA API calls such as cudaMalloc().
设备堆内存保留的内存是通过主机端 CUDA API 调用分配的内存之外的,例如 cudaMalloc()

7.35.2. Interoperability with Host Memory API
7.35.2. 与主机内存 API 的互操作性 

Memory allocated via device malloc() or __nv_aligned_device_malloc() cannot be freed using the runtime (i.e., by calling any of the free memory functions from Device Memory).
通过设备 malloc()__nv_aligned_device_malloc() 分配的内存不能通过运行时释放(即,不能调用任何来自设备内存的释放内存函数)。

Similarly, memory allocated via the runtime (i.e., by calling any of the memory allocation functions from Device Memory) cannot be freed via free().
类似地,通过运行时分配的内存(即,通过从设备内存调用任何内存分配函数)无法通过 free() 释放。

In addition, memory allocated by a call to malloc() or __nv_aligned_device_malloc() in device code cannot be used in any runtime or driver API calls (i.e. cudaMemcpy, cudaMemset, etc).
此外,在设备代码中通过调用 malloc()__nv_aligned_device_malloc() 分配的内存不能在任何运行时或驱动程序 API 调用中使用(即 cudaMemcpy、cudaMemset 等)。

7.35.3. Examples 7.35.3. 示例 

7.35.3.1. Per Thread Allocation
7.35.3.1. 每线程分配 

The following code sample:
以下代码示例:

#include <stdlib.h>
#include <stdio.h>

__global__ void mallocTest()
{
    size_t size = 123;
    char* ptr = (char*)malloc(size);
    memset(ptr, 0, size);
    printf("Thread %d got pointer: %p\n", threadIdx.x, ptr);
    free(ptr);
}

int main()
{
    // Set a heap size of 128 megabytes. Note that this must
    // be done before any kernel is launched.
    cudaDeviceSetLimit(cudaLimitMallocHeapSize, 128*1024*1024);
    mallocTest<<<1, 5>>>();
    cudaDeviceSynchronize();
    return 0;
}

will output: 将输出:

Thread 0 got pointer: 00057020
Thread 1 got pointer: 0005708c
Thread 2 got pointer: 000570f8
Thread 3 got pointer: 00057164
Thread 4 got pointer: 000571d0

Notice how each thread encounters the malloc() and memset() commands and so receives and initializes its own allocation. (Exact pointer values will vary: these are illustrative.)
请注意每个线程都会遇到 malloc()memset() 命令,因此会接收并初始化自己的分配。(确切的指针数值会有所不同:这些只是示例。)

7.35.3.2. Per Thread Block Allocation
7.35.3.2. 每个线程块分配 

#include <stdlib.h>

__global__ void mallocTest()
{
    __shared__ int* data;

    // The first thread in the block does the allocation and then
    // shares the pointer with all other threads through shared memory,
    // so that access can easily be coalesced.
    // 64 bytes per thread are allocated.
    if (threadIdx.x == 0) {
        size_t size = blockDim.x * 64;
        data = (int*)malloc(size);
    }
    __syncthreads();

    // Check for failure
    if (data == NULL)
        return;

    // Threads index into the memory, ensuring coalescence
    int* ptr = data;
    for (int i = 0; i < 64; ++i)
        ptr[i * blockDim.x + threadIdx.x] = threadIdx.x;

    // Ensure all threads complete before freeing
    __syncthreads();

    // Only one thread may free the memory!
    if (threadIdx.x == 0)
        free(data);
}

int main()
{
    cudaDeviceSetLimit(cudaLimitMallocHeapSize, 128*1024*1024);
    mallocTest<<<10, 128>>>();
    cudaDeviceSynchronize();
    return 0;
}

7.35.3.3. Allocation Persisting Between Kernel Launches
7.35.3.3. 内核启动之间的分配持久化 

#include <stdlib.h>
#include <stdio.h>

#define NUM_BLOCKS 20

__device__ int* dataptr[NUM_BLOCKS]; // Per-block pointer

__global__ void allocmem()
{
    // Only the first thread in the block does the allocation
    // since we want only one allocation per block.
    if (threadIdx.x == 0)
        dataptr[blockIdx.x] = (int*)malloc(blockDim.x * 4);
    __syncthreads();

    // Check for failure
    if (dataptr[blockIdx.x] == NULL)
        return;

    // Zero the data with all threads in parallel
    dataptr[blockIdx.x][threadIdx.x] = 0;
}

// Simple example: store thread ID into each element
__global__ void usemem()
{
    int* ptr = dataptr[blockIdx.x];
    if (ptr != NULL)
        ptr[threadIdx.x] += threadIdx.x;
}

// Print the content of the buffer before freeing it
__global__ void freemem()
{
    int* ptr = dataptr[blockIdx.x];
    if (ptr != NULL)
        printf("Block %d, Thread %d: final value = %d\n",
                      blockIdx.x, threadIdx.x, ptr[threadIdx.x]);

    // Only free from one thread!
    if (threadIdx.x == 0)
        free(ptr);
}

int main()
{
    cudaDeviceSetLimit(cudaLimitMallocHeapSize, 128*1024*1024);

    // Allocate memory
    allocmem<<< NUM_BLOCKS, 10 >>>();

    // Use memory
    usemem<<< NUM_BLOCKS, 10 >>>();
    usemem<<< NUM_BLOCKS, 10 >>>();
    usemem<<< NUM_BLOCKS, 10 >>>();

    // Free memory
    freemem<<< NUM_BLOCKS, 10 >>>();

    cudaDeviceSynchronize();

    return 0;
}

7.36. Execution Configuration
7.36. 执行配置 

Any call to a __global__ function must specify the execution configuration for that call. The execution configuration defines the dimension of the grid and blocks that will be used to execute the function on the device, as well as the associated stream (see CUDA Runtime for a description of streams).
__global__ 函数的任何调用都必须为该调用指定执行配置。执行配置定义了将用于在设备上执行函数的网格和块的维度,以及相关的流(请参阅 CUDA Runtime 以获取流的描述)。

The execution configuration is specified by inserting an expression of the form <<< Dg, Db, Ns, S >>> between the function name and the parenthesized argument list, where:
执行配置是通过在函数名称和括号参数列表之间插入形式为 <<< Dg, Db, Ns, S >>> 的表达式来指定的,其中:

  • Dg is of type dim3 (see dim3) and specifies the dimension and size of the grid, such that Dg.x * Dg.y * Dg.z equals the number of blocks being launched;
    Dg 的类型为 dim3 (请参见 dim3),指定了网格的维度和大小,使得 Dg.x * Dg.y * Dg.z 等于正在启动的块数;

  • Db is of type dim3 (see dim3) and specifies the dimension and size of each block, such that Db.x * Db.y * Db.z equals the number of threads per block;
    Db 的类型为 dim3 (请参见 dim3),并指定每个块的维度和大小,使得 Db.x * Db.y * Db.z 等于每个块的线程数;

  • Ns is of type size_t and specifies the number of bytes in shared memory that is dynamically allocated per block for this call in addition to the statically allocated memory; this dynamically allocated memory is used by any of the variables declared as an external array as mentioned in __shared__; Ns is an optional argument which defaults to 0;
    Ns 的类型为 size_t ,指定了在此调用中为每个块动态分配的共享内存中的字节数,除了静态分配的内存;这个动态分配的内存被任何声明为外部数组的变量使用,如在__shared__中提到的; Ns 是一个可选参数,默认值为 0;

  • S is of type cudaStream_t and specifies the associated stream; S is an optional argument which defaults to 0.
    S 的类型为 cudaStream_t ,并指定相关联的流; S 是一个可选参数,默认值为 0。

As an example, a function declared as
例如,声明为

__global__ void Func(float* parameter);

must be called like this:
必须这样调用:

Func<<< Dg, Db, Ns >>>(parameter);

The arguments to the execution configuration are evaluated before the actual function arguments.
执行配置的参数在实际函数参数之前进行评估。

The function call will fail if Dg or Db are greater than the maximum sizes allowed for the device as specified in Compute Capabilities, or if Ns is greater than the maximum amount of shared memory available on the device, minus the amount of shared memory required for static allocation.
如果 DgDb 大于在计算能力中指定的设备允许的最大尺寸,或者 Ns 大于设备上可用的共享内存的最大量减去静态分配所需的共享内存量,则函数调用将失败。

Compute capability 9.0 and above allows users to specify compile time thread block cluster dimensions, so that the kernel can use the cluster hierarchy in CUDA. Compile time cluster dimension can be specified using __cluster_dims__([x, [y, [z]]]). The example below shows compile time cluster size of 2 in X dimension and 1 in Y and Z dimension.
计算能力 9.0 及以上允许用户指定编译时线程块集群维度,以便内核可以在 CUDA 中使用集群层次结构。可以使用 __cluster_dims__([x, [y, [z]]]) 指定编译时集群维度。下面的示例显示 X 维度为 2,Y 和 Z 维度为 1 的编译时集群大小。

__global__ void __cluster_dims__(2, 1, 1) Func(float* parameter);

Thread block cluster dimensions can also be specified at runtime and kernel with the cluster can be launched using cudaLaunchKernelEx API. The API takes a configuration arugument of type cudaLaunchConfig_t, kernel function pointer and kernel arguments. Runtime kernel configuration is shown in the example below.
线程块集群维度也可以在运行时指定,并且可以使用 cudaLaunchKernelEx API 启动带有集群的内核。该 API 接受一个配置参数,类型为 cudaLaunchConfig_t ,内核函数指针和内核参数。运行时内核配置如下示例所示。

__global__ void Func(float* parameter);


// Kernel invocation with runtime cluster size
{
    cudaLaunchConfig_t config = {0};
    // The grid dimension is not affected by cluster launch, and is still enumerated
    // using number of blocks.
    // The grid dimension should be a multiple of cluster size.
    config.gridDim = Dg;
    config.blockDim = Db;
    config.dynamicSmemBytes = Ns;

    cudaLaunchAttribute attribute[1];
    attribute[0].id = cudaLaunchAttributeClusterDimension;
    attribute[0].val.clusterDim.x = 2; // Cluster size in X-dimension
    attribute[0].val.clusterDim.y = 1;
    attribute[0].val.clusterDim.z = 1;
    config.attrs = attribute;
    config.numAttrs = 1;

    float* parameter;
    cudaLaunchKernelEx(&config, Func, parameter);
}

7.37. Launch Bounds
7.37. 启动边界 

As discussed in detail in Multiprocessor Level, the fewer registers a kernel uses, the more threads and thread blocks are likely to reside on a multiprocessor, which can improve performance.
正如在多处理器级别中详细讨论的那样,内核使用的寄存器越少,越有可能有更多线程和线程块驻留在一个多处理器上,这可以提高性能。

Therefore, the compiler uses heuristics to minimize register usage while keeping register spilling (see Device Memory Accesses) and instruction count to a minimum. An application can optionally aid these heuristics by providing additional information to the compiler in the form of launch bounds that are specified using the __launch_bounds__() qualifier in the definition of a __global__ function:
因此,编译器使用启发式方法来最小化寄存器使用,同时保持寄存器溢出(请参见设备内存访问)和指令计数最小化。应用程序可以通过在 __global__ 函数的定义中使用 __launch_bounds__() 限定符指定的启动边界提供额外信息,从而可选择地帮助这些启发式方法:

__global__ void
__launch_bounds__(maxThreadsPerBlock, minBlocksPerMultiprocessor, maxBlocksPerCluster)
MyKernel(...)
{
    ...
}
  • maxThreadsPerBlock specifies the maximum number of threads per block with which the application will ever launch MyKernel(); it compiles to the .maxntidPTX directive.
    maxThreadsPerBlock 指定应用程序将启动的每个块的最大线程数;它编译为 .maxntid PTX 指令。

  • minBlocksPerMultiprocessor is optional and specifies the desired minimum number of resident blocks per multiprocessor; it compiles to the .minnctapersmPTX directive.
    minBlocksPerMultiprocessor 是可选的,指定每个多处理器所需的最小驻留块数;它编译为 .minnctapersm PTX 指令。

  • maxBlocksPerCluster is optional and specifies the desired maximum number thread blocks per cluster with which the application will ever launch MyKernel(); it compiles to the .maxclusterrankPTX directive.
    maxBlocksPerCluster 是可选的,指定应用程序将启动的每个集群的所需最大线程块数;它编译为 .maxclusterrank PTX 指令。

If launch bounds are specified, the compiler first derives from them the upper limit L on the number of registers the kernel should use to ensure that minBlocksPerMultiprocessor blocks (or a single block if minBlocksPerMultiprocessor is not specified) of maxThreadsPerBlock threads can reside on the multiprocessor (see Hardware Multithreading for the relationship between the number of registers used by a kernel and the number of registers allocated per block). The compiler then optimizes register usage in the following way:
如果指定了启动边界,编译器首先从中推导出内核应该使用的寄存器数量的上限 L,以确保 minBlocksPerMultiprocessor 个块(或者如果未指定 minBlocksPerMultiprocessor ,则为单个块)的 maxThreadsPerBlock 个线程可以驻留在多处理器上(请参阅硬件多线程以了解内核使用的寄存器数量与每个块分配的寄存器数量之间的关系)。然后编译器通过以下方式优化寄存器使用:

  • If the initial register usage is higher than L, the compiler reduces it further until it becomes less or equal to L, usually at the expense of more local memory usage and/or higher number of instructions;
    如果初始寄存器使用量高于 L,则编译器会进一步减少它,直到它变为小于或等于 L,通常以增加本地内存使用量和/或增加指令数量为代价;

  • If the initial register usage is lower than L
    如果初始寄存器使用低于 L

    • If maxThreadsPerBlock is specified and minBlocksPerMultiprocessor is not, the compiler uses maxThreadsPerBlock to determine the register usage thresholds for the transitions between n and n+1 resident blocks (i.e., when using one less register makes room for an additional resident block as in the example of Multiprocessor Level) and then applies similar heuristics as when no launch bounds are specified;
      如果指定了 maxThreadsPerBlock 而未指定 minBlocksPerMultiprocessor ,编译器将使用 maxThreadsPerBlock 来确定在 nn+1 之间的转换中寄存块的寄存器使用阈值(即,当使用一个更少的寄存器为另一个寄存块腾出空间时,如多处理器级别示例中),然后应用类似的启发式方法,就像没有指定启动边界时一样;

    • If both minBlocksPerMultiprocessor and maxThreadsPerBlock are specified, the compiler may increase register usage as high as L to reduce the number of instructions and better hide single thread instruction latency.
      如果同时指定 minBlocksPerMultiprocessormaxThreadsPerBlock ,编译器可能会将寄存器使用量增加到 L,以减少指令数量并更好地隐藏单线程指令延迟。

A kernel will fail to launch if it is executed with more threads per block than its launch bound maxThreadsPerBlock.
如果使用的线程数超过其启动限制 maxThreadsPerBlock ,则内核将无法启动。

A kernel will fail to launch if it is executed with more thread blocks per cluster than its launch bound maxBlocksPerCluster.
如果使用的线程块数超过其启动限制 maxBlocksPerCluster ,则内核将无法启动。

Per thread resources required by a CUDA kernel might limit the maximum block size in an unwanted way. In order to maintain forward compatibility to future hardware and toolkits and to ensure that at least one thread block can run on an SM, developers should include the single argument __launch_bounds__(maxThreadsPerBlock) which specifies the largest block size that the kernel will be launched with. Failure to do so could lead to “too many resources requested for launch” errors. Providing the two argument version of __launch_bounds__(maxThreadsPerBlock,minBlocksPerMultiprocessor) can improve performance in some cases. The right value for minBlocksPerMultiprocessor should be determined using a detailed per kernel analysis.
由 CUDA 内核需要的每个线程资源可能会以不希望的方式限制最大块大小。为了保持对未来硬件和工具包的向前兼容性,并确保至少一个线程块可以在 SM 上运行,开发人员应包含单个参数 __launch_bounds__(maxThreadsPerBlock) ,该参数指定内核将启动的最大块大小。如果不这样做,可能会导致“请求启动的资源过多”错误。在某些情况下,提供 __launch_bounds__(maxThreadsPerBlock,minBlocksPerMultiprocessor) 的两个参数版本可以提高性能。 minBlocksPerMultiprocessor 的正确值应通过详细的内核分析来确定。

Optimal launch bounds for a given kernel will usually differ across major architecture revisions. The sample code below shows how this is typically handled in device code using the __CUDA_ARCH__ macro introduced in Application Compatibility
给定内核的最佳启动边界通常会因主要架构修订而有所不同。下面的示例代码显示了如何在设备代码中使用引入的 __CUDA_ARCH__ 宏来处理这种情况。

#define THREADS_PER_BLOCK          256
#if __CUDA_ARCH__ >= 200
    #define MY_KERNEL_MAX_THREADS  (2 * THREADS_PER_BLOCK)
    #define MY_KERNEL_MIN_BLOCKS   3
#else
    #define MY_KERNEL_MAX_THREADS  THREADS_PER_BLOCK
    #define MY_KERNEL_MIN_BLOCKS   2
#endif

// Device code
__global__ void
__launch_bounds__(MY_KERNEL_MAX_THREADS, MY_KERNEL_MIN_BLOCKS)
MyKernel(...)
{
    ...
}

In the common case where MyKernel is invoked with the maximum number of threads per block (specified as the first parameter of __launch_bounds__()), it is tempting to use MY_KERNEL_MAX_THREADS as the number of threads per block in the execution configuration:
在常见情况下,当使用每个块的最大线程数(指定为 __launch_bounds__() 的第一个参数)调用 MyKernel 时,诱人的做法是在执行配置中将 MY_KERNEL_MAX_THREADS 用作每个块的线程数:

// Host code
MyKernel<<<blocksPerGrid, MY_KERNEL_MAX_THREADS>>>(...);

This will not work however since __CUDA_ARCH__ is undefined in host code as mentioned in Application Compatibility, so MyKernel will launch with 256 threads per block even when __CUDA_ARCH__ is greater or equal to 200. Instead the number of threads per block should be determined:
这不起作用,因为如应用程序兼容性中所述,主机代码中未定义 __CUDA_ARCH__ ,所以即使 __CUDA_ARCH__ 大于或等于 200, MyKernel 也会以每个块 256 个线程启动。应确定每个块的线程数:

  • Either at compile time using a macro that does not depend on __CUDA_ARCH__, for example
    在编译时使用一个不依赖于 __CUDA_ARCH__ 的宏,例如

    // Host code
    MyKernel<<<blocksPerGrid, THREADS_PER_BLOCK>>>(...);
    
  • Or at runtime based on the compute capability
    或者根据计算能力在运行时

    // Host code
    cudaGetDeviceProperties(&deviceProp, device);
    int threadsPerBlock =
              (deviceProp.major >= 2 ?
                        2 * THREADS_PER_BLOCK : THREADS_PER_BLOCK);
    MyKernel<<<blocksPerGrid, threadsPerBlock>>>(...);
    

Register usage is reported by the --ptxas-options=-v compiler option. The number of resident blocks can be derived from the occupancy reported by the CUDA profiler (see Device Memory Accessesfor a definition of occupancy).
寄存器使用情况由 --ptxas-options=-v 编译器选项报告。可以从 CUDA 分析器报告的占用率推导出驻留块的数量(请参阅设备内存访问以获取占用率的定义)。

The __launch_bounds__() and __maxnreg__() qualifiers cannot be applied to the same kernel.
__launch_bounds__()__maxnreg__() 修饰符不能应用于同一个内核。

Register usage can also be controlled for all __global__ functions in a file using the maxrregcount compiler option. The value of maxrregcount is ignored for functions with launch bounds.
在文件中,也可以使用 maxrregcount 编译器选项控制所有 __global__ 函数的寄存器使用。对于具有启动边界的函数, maxrregcount 的值将被忽略。

7.38. Maximum Number of Registers per Thread
7.38. 每个线程的寄存器数量上限 

To provide a mechanism for low-level performance tuning, CUDA C++ provides the __maxnreg()__ function qualifier to pass performance tuning information to the backend optimizing compiler. The __maxnreg__() qualifier specifies the maximum number of registers to be allocated to a single thread in a thread block. In the definition of a __global__ function:
为了提供低级性能调优机制,CUDA C++ 提供了 __maxnreg()__ 函数限定符,以将性能调优信息传递给后端优化编译器。 __maxnreg__() 限定符指定要分配给线程块中单个线程的寄存器的最大数量。在 __global__ 函数的定义中:

__global__ void
__maxnreg__(maxNumberRegistersPerThread)
MyKernel(...)
{
    ...
}
  • maxNumberRegistersPerThread specifies the maximum number of registers to be allocated to a single thread in a thread block of the kernel MyKernel(); it compiles to the .maxnregPTX directive.
    maxNumberRegistersPerThread 指定内核 MyKernel() 线程块中单个线程分配的寄存器的最大数量;它编译为 .maxnreg PTX 指令。

The __launch_bounds__() and __maxnreg__() qualifiers cannot be applied to the same kernel.
__launch_bounds__()__maxnreg__() 修饰符不能应用于同一个内核。

Register usage can also be controlled for all __global__ functions in a file using the maxrregcount compiler option. The value of maxrregcount is ignored for functions with the __maxnreg__ qualifier.
在文件中,也可以使用 maxrregcount 编译器选项控制所有 __global__ 函数的寄存器使用。对于具有 __maxnreg__ 限定符的函数, maxrregcount 的值将被忽略。

7.39. #pragma unroll

By default, the compiler unrolls small loops with a known trip count. The #pragma unroll directive however can be used to control unrolling of any given loop. It must be placed immediately before the loop and only applies to that loop. It is optionally followed by an integral constant expression (ICE)13. If the ICE is absent, the loop will be completely unrolled if its trip count is constant. If the ICE evaluates to 1, the compiler will not unroll the loop. The pragma will be ignored if the ICE evaluates to a non-positive integer or to an integer greater than the maximum value representable by the int data type.
默认情况下,编译器会展开具有已知迭代次数的小循环。然而, #pragma unroll 指令可用于控制任何给定循环的展开。它必须放置在循环之前,并且仅适用于该循环。它可以选择地后跟一个整数常量表达式(ICE)13。如果 ICE 不存在,则如果循环的迭代次数是常量,则循环将完全展开。如果 ICE 评估为 1,则编译器不会展开循环。如果 ICE 评估为非正整数或大于 int 数据类型可表示的最大值的整数,则将忽略该编译指示。

Examples: 示例:

struct S1_t { static const int value = 4; };
template <int X, typename T2>
__device__ void foo(int *p1, int *p2) {

// no argument specified, loop will be completely unrolled
#pragma unroll
for (int i = 0; i < 12; ++i)
  p1[i] += p2[i]*2;

// unroll value = 8
#pragma unroll (X+1)
for (int i = 0; i < 12; ++i)
  p1[i] += p2[i]*4;

// unroll value = 1, loop unrolling disabled
#pragma unroll 1
for (int i = 0; i < 12; ++i)
  p1[i] += p2[i]*8;

// unroll value = 4
#pragma unroll (T2::value)
for (int i = 0; i < 12; ++i)
  p1[i] += p2[i]*16;
}

__global__ void bar(int *p1, int *p2) {
foo<7, S1_t>(p1, p2);
}

7.40. SIMD Video Instructions
7.40. SIMD 视频指令 

PTX ISA version 3.0 includes SIMD (Single Instruction, Multiple Data) video instructions which operate on pairs of 16-bit values and quads of 8-bit values. These are available on devices of compute capability 3.0.
PTX ISA 版本 3.0 包括 SIMD(单指令,多数据)视频指令,可操作 16 位值对和 8 位值四元组。这些指令适用于计算能力为 3.0 的设备。

The SIMD video instructions are:
SIMD 视频指令为:

  • vadd2, vadd4 vadd2,vadd4

  • vsub2, vsub4 vsub2,vsub4

  • vavrg2, vavrg4 vavrg2,vavrg4

  • vabsdiff2, vabsdiff4 vabsdiff2,vabsdiff4

  • vmin2, vmin4 vmin2,vmin4

  • vmax2, vmax4 vmax2,vmax4

  • vset2, vset4 vset2,vset4

PTX instructions, such as the SIMD video instructions, can be included in CUDA programs by way of the assembler, asm(), statement.
PTX 指令,例如 SIMD 视频指令,可以通过汇编器的 asm() 语句包含在 CUDA 程序中。

The basic syntax of an asm() statement is:
asm() 语句的基本语法是:

asm("template-string" : "constraint"(output) : "constraint"(input)"));

An example of using the vabsdiff4 PTX instruction is:
使用 vabsdiff4 PTX 指令的示例是:

asm("vabsdiff4.u32.u32.u32.add" " %0, %1, %2, %3;": "=r" (result):"r" (A), "r" (B), "r" (C));

This uses the vabsdiff4 instruction to compute an integer quad byte SIMD sum of absolute differences. The absolute difference value is computed for each byte of the unsigned integers A and B in SIMD fashion. The optional accumulate operation (.add) is specified to sum these differences.
使用 vabsdiff4 指令计算整数四字节 SIMD 绝对差值之和。为 SIMD 方式计算无符号整数 A 和 B 的每个字节的绝对差值。指定可选的累加操作( .add )以对这些差值求和。

Refer to the document “Using Inline PTX Assembly in CUDA” for details on using the assembly statement in your code. Refer to the PTX ISA documentation (“Parallel Thread Execution ISA Version 3.0” for example) for details on the PTX instructions for the version of PTX that you are using.
请参考文档“在 CUDA 中使用内联 PTX 汇编”以获取有关在代码中使用汇编语句的详细信息。请参考 PTX ISA 文档(例如“并行线程执行 ISA 版本 3.0”)以获取有关您正在使用的 PTX 版本的 PTX 指令的详细信息。

7.41. Diagnostic Pragmas
7.41. 诊断性编诊指令 

The following pragmas may be used to control the error severity used when a given diagnostic message is issued.
以下的编译指示可用于控制在发出特定诊断消息时使用的错误严重性。

#pragma nv_diag_suppress
#pragma nv_diag_warning
#pragma nv_diag_error
#pragma nv_diag_default
#pragma nv_diag_once

Uses of these pragmas have the following form:
这些编译指示的用法如下形式:

#pragma nv_diag_xxx error_number, error_number ...

The diagnostic affected is specified using an error number showed in a warning message. Any diagnostic may be overridden to be an error, but only warnings may have their severity suppressed or be restored to a warning after being promoted to an error. The nv_diag_default pragma is used to return the severity of a diagnostic to the one that was in effect before any pragmas were issued (i.e., the normal severity of the message as modified by any command-line options). The following example suppresses the "declared but never referenced" warning on the declaration of foo:
指定受影响的诊断使用在警告消息中显示的错误编号。任何诊断都可以被覆盖为错误,但只有警告可以将其严重性抑制或在提升为错误后恢复为警告。 nv_diag_default 编诊指令用于将诊断的严重性恢复到在发出任何编诊指令之前生效的严重性(即通过任何命令行选项修改的消息的正常严重性)。以下示例抑制了对 foo 声明中的 "declared but never referenced" 警告:

#pragma nv_diag_suppress 177
void foo()
{
  int i=0;
}
#pragma nv_diag_default 177
void bar()
{
  int i=0;
}

The following pragmas may be used to save and restore the current diagnostic pragma state:
以下的编译指示可用于保存和恢复当前的诊断编译指示状态:

#pragma nv_diagnostic push
#pragma nv_diagnostic pop

Examples: 示例:

#pragma nv_diagnostic push
#pragma nv_diag_suppress 177
void foo()
{
  int i=0;
}
#pragma nv_diagnostic pop
void bar()
{
  int i=0;
}

Note that the pragmas only affect the nvcc CUDA frontend compiler; they have no effect on the host compiler.
请注意,这些编译指示仅影响 nvcc CUDA 前端编译器;对主机编译器没有影响。

Removal Notice: The support of diagnostic pragmas without nv_ prefix are removed from CUDA 12.0, if the pragmas are inside the device code, warning unrecognized #pragma in device code will be emitted, otherwise they will be passed to the host compiler. If they are intended for CUDA code, use the pragmas with nv_ prefix instead.
移除通知:从 CUDA 12.0 开始,不带 nv_ 前缀的诊断编词法支持已被移除,如果这些编词法位于设备代码内部,将会发出警告 unrecognized #pragma in device code ,否则它们将传递给主机编译器。如果它们用于 CUDA 代码,请改用带有 nv_ 前缀的编词法。

11

When the enclosing __host__ function is a template, nvcc may currently fail to issue a diagnostic message in some cases; this behavior may change in the future.
当封闭的__host__函数是一个模板时,nvcc 目前可能在某些情况下无法发出诊断消息;这种行为可能会在将来发生变化。

12

The intent is to prevent the host compiler from encountering the call to the function if the host compiler does not support it.
意图是防止主机编译器在不支持该函数的情况下遇到对该函数的调用。

13(1,2)

See the C++ Standard for definition of integral constant expression.
请参阅 C++ 标准以获取整数常量表达式的定义。

8. Cooperative Groups
8. 合作组。

8.1. Introduction 8.1. 简介 

Cooperative Groups is an extension to the CUDA programming model, introduced in CUDA 9, for organizing groups of communicating threads. Cooperative Groups allows developers to express the granularity at which threads are communicating, helping them to express richer, more efficient parallel decompositions.
Cooperative Groups 是 CUDA 编程模型的扩展,引入于 CUDA 9,用于组织通信线程组。 Cooperative Groups 允许开发人员表达线程通信的粒度,帮助他们表达更丰富、更高效的并行分解。

Historically, the CUDA programming model has provided a single, simple construct for synchronizing cooperating threads: a barrier across all threads of a thread block, as implemented with the __syncthreads() intrinsic function. However, programmers would like to define and synchronize groups of threads at other granularities to enable greater performance, design flexibility, and software reuse in the form of “collective” group-wide function interfaces. In an effort to express broader patterns of parallel interaction, many performance-oriented programmers have resorted to writing their own ad hoc and unsafe primitives for synchronizing threads within a single warp, or across sets of thread blocks running on a single GPU. Whilst the performance improvements achieved have often been valuable, this has resulted in an ever-growing collection of brittle code that is expensive to write, tune, and maintain over time and across GPU generations. Cooperative Groups addresses this by providing a safe and future-proof mechanism to enable performant code.
在历史上,CUDA 编程模型提供了一种简单的构造来同步协作线程:跨线程块的屏障,实现为 __syncthreads() 内置函数。然而,程序员希望定义和同步其他粒度的线程组,以实现更高的性能、设计灵活性和软件重用,形式为“集体”全组函数接口。为了表达更广泛的并行交互模式,许多性能导向的程序员已经开始编写自己的临时和不安全的原语,用于同步单个 warp 内的线程,或跨运行在单个 GPU 上的线程块集。尽管通常可以获得有价值的性能改进,但这导致了一个不断增长的脆弱代码集合,难以编写、调整和跨 GPU 世代维护。合作组通过提供一种安全且具有未来保障的机制来实现高性能代码。

8.2. What’s New in Cooperative Groups
8.2. 协作组中的新功能 

8.2.1. CUDA 12.2

  • barrier_arrive and barrier_wait member functions were added for grid_group and thread_block. Description of the API is available here.
    barrier_arrivebarrier_wait 成员函数已添加到 grid_group 和 thread_block。API 的描述可在此处找到。

8.2.2. CUDA 12.1

8.2.3. CUDA 12.0

  • The following experimental APIs are now moved to the main namespace:
    以下实验性 API 现已移至主命名空间:

    • asynchronous reduce and scan update added in CUDA 11.7
      在 CUDA 11.7 中添加了异步 reduce 和 scan 更新

    • thread_block_tile larger than 32 added in CUDA 11.1
      thread_block_tile 大于 32 添加在 CUDA 11.1

  • It is no longer required to provide memory using the block_tile_memory object in order to create these large tiles on Compute Capability 8.0 or higher.
    在 Compute Capability 8.0 或更高版本上,不再需要使用 block_tile_memory 对象提供内存来创建这些大块瓦片。

8.3. Programming Model Concept
8.3. 编程模型概念 

The Cooperative Groups programming model describes synchronization patterns both within and across CUDA thread blocks. It provides both the means for applications to define their own groups of threads, and the interfaces to synchronize them. It also provides new launch APIs that enforce certain restrictions and therefore can guarantee the synchronization will work. These primitives enable new patterns of cooperative parallelism within CUDA, including producer-consumer parallelism, opportunistic parallelism, and global synchronization across the entire Grid.
合作组编程模型描述了 CUDA 线程块内部和之间的同步模式。它既提供了应用程序定义自己线程组的方法,也提供了同步它们的接口。它还提供了新的启动 API,强制执行某些限制,从而可以保证同步工作。这些原语在 CUDA 内部启用了新的合作并行模式,包括生产者-消费者并行模式、机会并行模式以及整个网格的全局同步。

The Cooperative Groups programming model consists of the following elements:
合作组编程模型由以下元素组成:

  • Data types for representing groups of cooperating threads;
    用于表示协作线程组的数据类型;

  • Operations to obtain implicit groups defined by the CUDA launch API (e.g., thread blocks);
    通过 CUDA 启动 API(例如,线程块)定义的隐式组的操作;

  • Collectives for partitioning existing groups into new groups;
    将现有组划分为新组的集合;

  • Collective Algorithms for data movement and manipulation (e.g. memcpy_async, reduce, scan);
    数据移动和操作的集体算法(例如 memcpy_async、reduce、scan);

  • An operation to synchronize all threads within the group;
    同步组内所有线程的操作;

  • Operations to inspect the group properties;
    检查组属性的操作;

  • Collectives that expose low-level, group-specific and often HW accelerated, operations.
    暴露低级别、特定于组且通常是硬件加速的操作的集合。

The main concept in Cooperative Groups is that of objects naming the set of threads that are part of it. This expression of groups as first-class program objects improves software composition, since collective functions can receive an explicit object representing the group of participating threads. This object also makes programmer intent explicit, which eliminates unsound architectural assumptions that result in brittle code, undesirable restrictions upon compiler optimizations, and better compatibility with new GPU generations.
在协作组中的主要概念是对象命名其中的一组线程。将组表达为一流程序对象可以改善软件组合,因为集体函数可以接收表示参与线程组的显式对象。该对象还可以使程序员意图明确,从而消除导致脆弱代码、不良限制编译器优化以及更好地与新 GPU 一代兼容的不良架构假设。

To write efficient code, its best to use specialized groups (going generic loses a lot of compile time optimizations), and pass these group objects by reference to functions that intend to use these threads in some cooperative fashion.
为了编写高效的代码,最好使用专门的组(通用性会丢失很多编译时优化),并通过引用将这些组对象传递给打算以某种合作方式使用这些线程的函数。

Cooperative Groups requires CUDA 9.0 or later. To use Cooperative Groups, include the header file:
协作组需要 CUDA 9.0 或更高版本。要使用协作组,请包含头文件:

// Primary header is compatible with pre-C++11, collective algorithm headers require C++11
#include <cooperative_groups.h>
// Optionally include for memcpy_async() collective
#include <cooperative_groups/memcpy_async.h>
// Optionally include for reduce() collective
#include <cooperative_groups/reduce.h>
// Optionally include for inclusive_scan() and exclusive_scan() collectives
#include <cooperative_groups/scan.h>

and use the Cooperative Groups namespace:
并使用 Cooperative Groups 命名空间:

using namespace cooperative_groups;
// Alternatively use an alias to avoid polluting the namespace with collective algorithms
namespace cg = cooperative_groups;

The code can be compiled in a normal way using nvcc, however if you wish to use memcpy_async, reduce or scan functionality and your host compiler’s default dialect is not C++11 or higher, then you must add --std=c++11 to the command line.
代码可以使用 nvcc 正常编译,但如果您希望使用 memcpy_async、reduce 或 scan 功能,并且您的主机编译器的默认方言不是 C++11 或更高版本,则必须将 --std=c++11 添加到命令行。

8.3.1. Composition Example
8.3.1. 组合示例 

To illustrate the concept of groups, this example attempts to perform a block-wide sum reduction. Previously, there were hidden constraints on the implementation when writing this code:
为了说明组的概念,此示例尝试执行一个块范围的求和归约。在编写此代码时,以前存在对实现的隐藏约束:

__device__ int sum(int *x, int n) {
    // ...
    __syncthreads();
    return total;
}

__global__ void parallel_kernel(float *x) {
    // ...
    // Entire thread block must call sum
    sum(x, n);
}

All threads in the thread block must arrive at the __syncthreads() barrier, however, this constraint is hidden from the developer who might want to use sum(…). With Cooperative Groups, a better way of writing this would be:
线程块中的所有线程必须到达 __syncthreads() 屏障,但是开发人员可能希望使用 sum(…) ,这个约束对开发人员是隐藏的。使用协作组,更好的编写方式是:

__device__ int sum(const thread_block& g, int *x, int n) {
    // ...
    g.sync()
    return total;
}

__global__ void parallel_kernel(...) {
    // ...
    // Entire thread block must call sum
    thread_block tb = this_thread_block();
    sum(tb, x, n);
    // ...
}

8.4. Group Types
8.4. 组类型 

8.4.1. Implicit Groups
8.4.1. 隐式组 

Implicit groups represent the launch configuration of the kernel. Regardless of how your kernel is written, it always has a set number of threads, blocks and block dimensions, a single grid and grid dimensions. In addition, if the multi-device cooperative launch API is used, it can have multiple grids (single grid per device). These groups provide a starting point for decomposition into finer grained groups which are typically HW accelerated and are more specialized for the problem the developer is solving.
隐式组代表内核的启动配置。无论内核如何编写,它始终具有一组线程、块和块维度、一个网格和网格维度。此外,如果使用多设备协作启动 API,则可以拥有多个网格(每个设备一个网格)。这些组为细分为更精细的组提供了起点,这些组通常是硬件加速的,并且更专门用于开发人员正在解决的问题。

Although you can create an implicit group anywhere in the code, it is dangerous to do so. Creating a handle for an implicit group is a collective operation—all threads in the group must participate. If the group was created in a conditional branch that not all threads reach, this can lead to deadlocks or data corruption. For this reason, it is recommended that you create a handle for the implicit group upfront (as early as possible, before any branching has occurred) and use that handle throughout the kernel. Group handles must be initialized at declaration time (there is no default constructor) for the same reason and copy-constructing them is discouraged.
尽管您可以在代码的任何地方创建隐式组,但这样做是危险的。为隐式组创建一个句柄是一个集体操作,组中的所有线程都必须参与。如果组是在不是所有线程都到达的条件分支中创建的,这可能导致死锁或数据损坏。因此,建议您尽早(在发生任何分支之前)为隐式组创建一个句柄,并在整个内核中使用该句柄。出于同样的原因,组句柄必须在声明时初始化(没有默认构造函数),并且不鼓励复制构造它们。

8.4.1.1. Thread Block Group
8.4.1.1. 线程块组 

Any CUDA programmer is already familiar with a certain group of threads: the thread block. The Cooperative Groups extension introduces a new datatype, thread_block, to explicitly represent this concept within the kernel.
任何 CUDA 程序员都已经熟悉了一组特定的线程:线程块。合作组扩展引入了一个新的数据类型, thread_block ,以在内核中明确表示这个概念。

class thread_block;

Constructed via: 通过构建:

thread_block g = this_thread_block();

Public Member Functions: 公共成员函数:

static void sync(): Synchronize the threads named in the group, equivalent to g.barrier_wait(g.barrier_arrive())
static void sync() :同步组中命名的线程,相当于 g.barrier_wait(g.barrier_arrive())

thread_block::arrival_token barrier_arrive(): Arrive on the thread_block barrier, returns a token that needs to be passed into barrier_wait(). More details here
thread_block::arrival_token barrier_arrive() :到达线程块屏障,返回一个需要传递给 barrier_wait() 的令牌。更多细节请查看此处。

void barrier_wait(thread_block::arrival_token&& t): Wait on the thread_block barrier, takes arrival token returned from barrier_arrive() as a rvalue reference. More details here
void barrier_wait(thread_block::arrival_token&& t) :在 thread_block 障碍上等待,以从 barrier_arrive() 返回的到达令牌作为右值引用。更多细节请参阅此处。

static unsigned int thread_rank(): Rank of the calling thread within [0, num_threads)
static unsigned int thread_rank() :调用线程在[0,num_threads)范围内的排名

static dim3 group_index(): 3-Dimensional index of the block within the launched grid
static dim3 group_index() :启动网格内块的三维索引

static dim3 thread_index(): 3-Dimensional index of the thread within the launched block
static dim3 thread_index() :启动块内线程的三维索引

static dim3 dim_threads(): Dimensions of the launched block in units of threads
static dim3 dim_threads() :以线程单位测量的启动块的维度

static unsigned int num_threads(): Total number of threads in the group
static unsigned int num_threads() :组中的线程总数

Legacy member functions (aliases):
传统成员函数(别名):

static unsigned int size(): Total number of threads in the group (alias of num_threads())
static unsigned int size() :组中的线程总数( num_threads() 的别名)

static dim3 group_dim(): Dimensions of the launched block (alias of dim_threads())
static dim3 group_dim() :已启动块的维度( dim_threads() 的别名)

Example: 示例:

/// Loading an integer from global into shared memory
__global__ void kernel(int *globalInput) {
    __shared__ int x;
    thread_block g = this_thread_block();
    // Choose a leader in the thread block
    if (g.thread_rank() == 0) {
        // load from global into shared for all threads to work with
        x = (*globalInput);
    }
    // After loading data into shared memory, you want to synchronize
    // if all threads in your thread block need to see it
    g.sync(); // equivalent to __syncthreads();
}

Note: that all threads in the group must participate in collective operations, or the behavior is undefined.
注意:组中的所有线程必须参与集体操作,否则行为未定义。

Related: The thread_block datatype is derived from the more generic thread_group datatype, which can be used to represent a wider class of groups.
相关: thread_block 数据类型是从更通用的 thread_group 数据类型派生的,可以用来表示更广泛的群组类别。

8.4.1.2. Cluster Group
8.4.1.2. 集群组 

This group object represents all the threads launched in a single cluster. Refer to Thread Block Clusters. The APIs are available on all hardware with Compute Capability 9.0+. In such cases, when a non-cluster grid is launched, the APIs assume a 1x1x1 cluster.
此组对象表示在单个集群中启动的所有线程。请参阅线程块集群。这些 API 可在具有计算能力 9.0+ 的所有硬件上使用。在这种情况下,当启动非集群网格时,API 假定一个 1x1x1 集群。

class cluster_group;

Constructed via: 通过构建:

cluster_group g = this_cluster();

Public Member Functions: 公共成员函数:

static void sync(): Synchronize the threads named in the group, equivalent to g.barrier_wait(g.barrier_arrive())
static void sync() :同步组中命名的线程,相当于 g.barrier_wait(g.barrier_arrive())

static cluster_group::arrival_token barrier_arrive(): Arrive on the cluster barrier, returns a token that needs to be passed into barrier_wait(). More details here
static cluster_group::arrival_token barrier_arrive() :到达集群屏障,返回一个需要传递到 barrier_wait() 的令牌。更多细节请查看这里。

static void barrier_wait(cluster_group::arrival_token&& t): Wait on the cluster barrier, takes arrival token returned from barrier_arrive() as a rvalue reference. More details here
static void barrier_wait(cluster_group::arrival_token&& t) :在集群屏障上等待,以从 barrier_arrive() 返回的到达令牌作为右值引用。更多细节请参见此处。

static unsigned int thread_rank(): Rank of the calling thread within [0, num_threads)
static unsigned int thread_rank() :调用线程在[0,num_threads)范围内的排名

static unsigned int block_rank(): Rank of the calling block within [0, num_blocks)
static unsigned int block_rank() :调用块在[0,num_blocks)内的排名

static unsigned int num_threads(): Total number of threads in the group
static unsigned int num_threads() :组中的线程总数

static unsigned int num_blocks(): Total number of blocks in the group
static unsigned int num_blocks() :组中的总块数

static dim3 dim_threads(): Dimensions of the launched cluster in units of threads
static dim3 dim_threads() :以线程单位启动的集群维度

static dim3 dim_blocks(): Dimensions of the launched cluster in units of blocks
static dim3 dim_blocks() :以块为单位启动集群的维度

static dim3 block_index(): 3-Dimensional index of the calling block within the launched cluster
static dim3 block_index() :调用块在启动的集群中的三维索引

static unsigned int query_shared_rank(const void *addr): Obtain the block rank to which a shared memory address belongs
static unsigned int query_shared_rank(const void *addr) :获取共享内存地址所属的块排名

static T* map_shared_rank(T *addr, int rank): Obtain the address of a shared memory variable of another block in the cluster
static T* map_shared_rank(T *addr, int rank) :获取集群中另一个块的共享内存变量的地址

Legacy member functions (aliases):
传统成员函数(别名):

static unsigned int size(): Total number of threads in the group (alias of num_threads())
static unsigned int size() :组中的线程总数( num_threads() 的别名)

8.4.1.3. Grid Group
8.4.1.3. 网格组 

This group object represents all the threads launched in a single grid. APIs other than sync() are available at all times, but to be able to synchronize across the grid, you need to use the cooperative launch API.
此组对象代表在单个网格中启动的所有线程。除 sync() 之外的 API 始终可用,但要能够在整个网格中进行同步,您需要使用协作启动 API。

class grid_group;

Constructed via: 通过构建:

grid_group g = this_grid();

Public Member Functions: 公共成员函数:

bool is_valid() const: Returns whether the grid_group can synchronize
bool is_valid() const :返回 grid_group 是否可以同步

void sync() const: Synchronize the threads named in the group, equivalent to g.barrier_wait(g.barrier_arrive())
void sync() const :同步组中命名的线程,相当于 g.barrier_wait(g.barrier_arrive())

grid_group::arrival_token barrier_arrive(): Arrive on the grid barrier, returns a token that needs to be passed into barrier_wait(). More details here
grid_group::arrival_token barrier_arrive() :到达网格屏障时,返回一个需要传递到 barrier_wait() 的令牌。更多细节请查看这里。

void barrier_wait(grid_group::arrival_token&& t): Wait on the grid barrier, takes arrival token returned from barrier_arrive() as a rvalue reference. More details here
void barrier_wait(grid_group::arrival_token&& t) :在网格屏障上等待,以从 barrier_arrive() 返回的到达令牌作为右值引用。更多细节请参阅此处

static unsigned long long thread_rank(): Rank of the calling thread within [0, num_threads)
static unsigned long long thread_rank() :调用线程在[0,num_threads)范围内的排名

static unsigned long long block_rank(): Rank of the calling block within [0, num_blocks)
static unsigned long long block_rank() :调用块在[0,num_blocks)内的排名

static unsigned long long cluster_rank(): Rank of the calling cluster within [0, num_clusters)
static unsigned long long cluster_rank() :调用集群在[0,num_clusters)范围内的排名

static unsigned long long num_threads(): Total number of threads in the group
static unsigned long long num_threads() :组中的线程总数

static unsigned long long num_blocks(): Total number of blocks in the group
static unsigned long long num_blocks() :组中的总块数

static unsigned long long num_clusters(): Total number of clusters in the group
static unsigned long long num_clusters() :组中的总群集数

static dim3 dim_blocks(): Dimensions of the launched grid in units of blocks
static dim3 dim_blocks() :以方块为单位启动的网格尺寸

static dim3 dim_clusters(): Dimensions of the launched grid in units of clusters
static dim3 dim_clusters() :以集群单位为单位启动的网格维度

static dim3 block_index(): 3-Dimensional index of the block within the launched grid
static dim3 block_index() :启动网格内块的三维索引

static dim3 cluster_index(): 3-Dimensional index of the cluster within the launched grid
static dim3 cluster_index() :在启动的网格内的集群的三维索引

Legacy member functions (aliases):
传统成员函数(别名):

static unsigned long long size(): Total number of threads in the group (alias of num_threads())
static unsigned long long size() :组中的线程总数( num_threads() 的别名)

static dim3 group_dim(): Dimensions of the launched grid (alias of dim_blocks())
static dim3 group_dim() :启动网格的维度(别名为 dim_blocks()

8.4.1.4. Multi Grid Group
8.4.1.4. 多网格组 

This group object represents all the threads launched across all devices of a multi-device cooperative launch. Unlike the grid.group, all the APIs require that you have used the appropriate launch API.
此组对象表示跨多设备协作启动的所有线程。与 grid.group 不同,所有 API 都要求您已使用适当的启动 API。

class multi_grid_group;

Constructed via: 通过构建:

// Kernel must be launched with the cooperative multi-device API
multi_grid_group g = this_multi_grid();

Public Member Functions: 公共成员函数:

bool is_valid() const: Returns whether the multi_grid_group can be used
bool is_valid() const :返回 multi_grid_group 是否可用

void sync() const: Synchronize the threads named in the group
void sync() const :同步组中命名的线程

unsigned long long num_threads() const: Total number of threads in the group
unsigned long long num_threads() const :组中的线程总数

unsigned long long thread_rank() const: Rank of the calling thread within [0, num_threads)
unsigned long long thread_rank() const :调用线程在[0,num_threads)范围内的排名

unsigned int grid_rank() const: Rank of the grid within [0,num_grids]
unsigned int grid_rank() const :网格在[0,num_grids]中的排名

unsigned int num_grids() const: Total number of grids launched
unsigned int num_grids() const :已启动的网格总数

Legacy member functions (aliases):
传统成员函数(别名):

unsigned long long size() const: Total number of threads in the group (alias of num_threads())
unsigned long long size() const :组中的线程总数( num_threads() 的别名)

Deprecation Notice: multi_grid_group has been deprecated in CUDA 11.3 for all devices.
弃用通知:CUDA 11.3 中已弃用所有设备中的 multi_grid_group

8.4.2. Explicit Groups
8.4.2. 显式组 

8.4.2.1. Thread Block Tile
8.4.2.1. 线程块瓦片 

A templated version of a tiled group, where a template parameter is used to specify the size of the tile - with this known at compile time there is the potential for more optimal execution.
一个瓷砖组的模板化版本,其中使用模板参数来指定瓷砖的大小 - 在编译时已知这一点,有可能实现更优化的执行。

template <unsigned int Size, typename ParentT = void>
class thread_block_tile;

Constructed via: 通过构建:

template <unsigned int Size, typename ParentT>
_CG_QUALIFIER thread_block_tile<Size, ParentT> tiled_partition(const ParentT& g)

Size must be a power of 2 and less than or equal to 1024. Notes section describes extra steps needed to create tiles of size larger than 32 on hardware with Compute Capability 7.5 or lower.
Size 必须是 2 的幂且小于或等于 1024。备注部分描述了在具有计算能力 7.5 或更低的硬件上创建大于 32 的瓦片尺寸所需的额外步骤。

ParentT is the parent-type from which this group was partitioned. It is automatically inferred, but a value of void will store this information in the group handle rather than in the type.
ParentT 是将此组分区的父类型。它会自动推断,但 void 值将会将此信息存储在组句柄而不是类型中。

Public Member Functions: 公共成员函数:

void sync() const: Synchronize the threads named in the group
void sync() const :同步组中命名的线程

unsigned long long num_threads() const: Total number of threads in the group
unsigned long long num_threads() const :组中的线程总数

unsigned long long thread_rank() const: Rank of the calling thread within [0, num_threads)
unsigned long long thread_rank() const :调用线程在[0,num_threads)范围内的排名

unsigned long long meta_group_size() const: Returns the number of groups created when the parent group was partitioned.
unsigned long long meta_group_size() const :返回父组分区时创建的组数。

unsigned long long meta_group_rank() const: Linear rank of the group within the set of tiles partitioned from a parent group (bounded by meta_group_size)
unsigned long long meta_group_rank() const :在从父组分区的瓦片集合中的线性等级(受 meta_group_size 限制)

T shfl(T var, unsigned int src_rank) const: Refer to Warp Shuffle Functions, Note: For sizes larger than 32 all threads in the group have to specify the same src_rank, otherwise the behavior is undefined.
T shfl(T var, unsigned int src_rank) const :参考 Warp Shuffle 函数,注意:对于大于 32 的大小,组中的所有线程都必须指定相同的 src_rank,否则行为是未定义的。

T shfl_up(T var, int delta) const: Refer to Warp Shuffle Functions, available only for sizes lower or equal to 32.
T shfl_up(T var, int delta) const :请参考 Warp Shuffle 函数,仅适用于大小小于或等于 32 的情况。

T shfl_down(T var, int delta) const: Refer to Warp Shuffle Functions, available only for sizes lower or equal to 32.
T shfl_down(T var, int delta) const :请参考 Warp Shuffle 函数,仅适用于大小小于或等于 32 的情况。

T shfl_xor(T var, int delta) const: Refer to Warp Shuffle Functions, available only for sizes lower or equal to 32.
T shfl_xor(T var, int delta) const :请参考 Warp Shuffle 函数,仅适用于大小小于或等于 32 的情况。

T any(int predicate) const: Refer to Warp Vote Functions
T any(int predicate) const :参考 Warp 投票功能

T all(int predicate) const: Refer to Warp Vote Functions
T all(int predicate) const :参考 Warp 投票功能

T ballot(int predicate) const: Refer to Warp Vote Functions, available only for sizes lower or equal to 32.
T ballot(int predicate) const :请参考 Warp 投票功能,仅适用于大小小于或等于 32。

unsigned int match_any(T val) const: Refer to Warp Match Functions, available only for sizes lower or equal to 32.
unsigned int match_any(T val) const :请参考 Warp 匹配函数,仅适用于大小小于或等于 32 的情况。

unsigned int match_all(T val, int &pred) const: Refer to Warp Match Functions, available only for sizes lower or equal to 32.
unsigned int match_all(T val, int &pred) const :请参考 Warp 匹配函数,仅适用于大小小于或等于 32 的情况。

Legacy member functions (aliases):
传统成员函数(别名):

unsigned long long size() const: Total number of threads in the group (alias of num_threads())
unsigned long long size() const :组中的线程总数( num_threads() 的别名)

Notes: 注意事项:

  • thread_block_tile templated data structure is being used here, the size of the group is passed to the tiled_partition call as a template parameter rather than an argument.
    thread_block_tile 模板化数据结构正在此处使用,组的大小作为模板参数传递给 tiled_partition 调用,而不是作为参数。

  • shfl, shfl_up, shfl_down, and shfl_xor functions accept objects of any type when compiled with C++11 or later. This means it’s possible to shuffle non-integral types as long as they satisfy the below constraints:
    shfl, shfl_up, shfl_down, and shfl_xor 函数在使用 C++11 或更高版本编译时接受任何类型的对象。这意味着可以对非整数类型进行洗牌,只要它们满足以下约束条件:

    • Qualifies as trivially copyable i.e., is_trivially_copyable<T>::value == true
      符合可以被平凡复制的条件,即 is_trivially_copyable<T>::value == true

    • sizeof(T) <= 32 for tile sizes lower or equal 32, sizeof(T) <= 8 for larger tiles
      sizeof(T) <= 32 用于小于或等于 32 的瓷砖尺寸, sizeof(T) <= 8 用于更大的瓷砖

  • On hardware with Compute Capability 7.5 or lower tiles of size larger than 32 need small amount of memory reserved for them. This can be done using cooperative_groups::block_tile_memory struct template that has to reside in either shared or global memory.
    在具有计算能力 7.5 或更低的硬件上,大于 32 的大小的瓦片需要为它们保留少量内存。这可以使用必须驻留在共享或全局内存中的 cooperative_groups::block_tile_memory 结构模板来完成。

    template <unsigned int MaxBlockSize = 1024>
    struct block_tile_memory;
    

    MaxBlockSize Specifies the maximal number of threads in the current thread block. This parameter can be used to minimize the shared memory usage of block_tile_memory in kernels launched only with smaller thread counts.
    MaxBlockSize 指定当前线程块中的最大线程数。此参数可用于最小化仅使用较小线程计数启动的内核中 block_tile_memory 的共享内存使用。

    This block_tile_memory needs be then passed into cooperative_groups::this_thread_block, allowing the resulting thread_block to be partitioned into tiles of sizes larger than 32. Overload of this_thread_block accepting block_tile_memory argument is a collective operation and has to be called with all threads in the thread_block.
    这个 block_tile_memory 需要被传递到 cooperative_groups::this_thread_block ,从而允许将生成的 thread_block 划分为大于 32 的瓦片大小。接受 block_tile_memory 参数的 this_thread_block 重载是一个集体操作,必须在 thread_block 中的所有线程中调用。

    block_tile_memory can be used on hardware with Compute Capability 8.0 or higher in order to be able to write one source targeting multiple different Compute Capabilities. It should consume no memory when instantiated in shared memory in cases where its not required.
    block_tile_memory 可以在具有 Compute Capability 8.0 或更高版本的硬件上使用,以便能够编写一个针对多个不同 Compute Capability 的源代码。在不需要时,它在共享内存中实例化时不应消耗任何内存。

Examples: 示例:

/// The following code will create two sets of tiled groups, of size 32 and 4 respectively:
/// The latter has the provenance encoded in the type, while the first stores it in the handle
thread_block block = this_thread_block();
thread_block_tile<32> tile32 = tiled_partition<32>(block);
thread_block_tile<4, thread_block> tile4 = tiled_partition<4>(block);
/// The following code will create tiles of size 128 on all Compute Capabilities.
/// block_tile_memory can be omitted on Compute Capability 8.0 or higher.
__global__ void kernel(...) {
    // reserve shared memory for thread_block_tile usage,
    //   specify that block size will be at most 256 threads.
    __shared__ block_tile_memory<256> shared;
    thread_block thb = this_thread_block(shared);

    // Create tiles with 128 threads.
    auto tile = tiled_partition<128>(thb);

    // ...
}
8.4.2.1.1. Warp-Synchronous Code Pattern
8.4.2.1.1. Warp-Synchronous Code Pattern  8.4.2.1.1. 弯曲同步代码模式 

Developers might have had warp-synchronous codes that they previously made implicit assumptions about the warp size and would code around that number. Now this needs to be specified explicitly.
开发人员可能已经编写了 warp-synchronous 代码,之前对 warp 大小做出了隐含假设,并围绕该数字编写代码。现在需要明确指定。

__global__ void cooperative_kernel(...) {
    // obtain default "current thread block" group
    thread_block my_block = this_thread_block();

    // subdivide into 32-thread, tiled subgroups
    // Tiled subgroups evenly partition a parent group into
    // adjacent sets of threads - in this case each one warp in size
    auto my_tile = tiled_partition<32>(my_block);

    // This operation will be performed by only the
    // first 32-thread tile of each block
    if (my_tile.meta_group_rank() == 0) {
        // ...
        my_tile.sync();
    }
}
8.4.2.1.2. Single thread group
8.4.2.1.2. 单线程组 

Group representing the current thread can be obtained from this_thread function:
当前线程的组可以从 this_thread 函数中获取:

thread_block_tile<1> this_thread();

The following memcpy_async API uses a thread_group, to copy an int element from source to destination:
以下 memcpy_async API 使用 thread_group ,从源到目的地复制一个 int 元素:

#include <cooperative_groups.h>
#include <cooperative_groups/memcpy_async.h>

cooperative_groups::memcpy_async(cooperative_groups::this_thread(), dest, src, sizeof(int));

More detailed examples of using this_thread to perform asynchronous copies can be found in the Single-Stage Asynchronous Data Copies using cuda::pipeline and Multi-Stage Asynchronous Data Copies using cuda::pipeline sections.
在使用 this_thread 执行异步拷贝的更详细示例可在使用 cuda::pipeline 进行单阶段异步数据拷贝和使用 cuda::pipeline 进行多阶段异步数据拷贝部分找到。

8.4.2.2. Coalesced Groups
8.4.2.2. 合并组 

In CUDA’s SIMT architecture, at the hardware level the multiprocessor executes threads in groups of 32 called warps. If there exists a data-dependent conditional branch in the application code such that threads within a warp diverge, then the warp serially executes each branch disabling threads not on that path. The threads that remain active on the path are referred to as coalesced. Cooperative Groups has functionality to discover, and create, a group containing all coalesced threads.
在 CUDA 的 SIMT 架构中,在硬件级别上,多处理器以 32 个一组的线程(称为 warp)执行。如果应用程序代码中存在数据相关的条件分支,使得 warp 内的线程分歧,那么 warp 将串行执行每个分支,禁用不在该路径上的线程。保持在路径上活动的线程被称为协同。协作组具有发现和创建包含所有协同线程的组的功能。

Constructing the group handle via coalesced_threads() is opportunistic. It returns the set of active threads at that point in time, and makes no guarantee about which threads are returned (as long as they are active) or that they will stay coalesced throughout execution (they will be brought back together for the execution of a collective but can diverge again afterwards).
通过 coalesced_threads() 构建组句柄是机会主义的。它返回该时间点处活动线程的集合,并不保证返回哪些线程(只要它们是活动的)或它们在执行过程中会保持聚合(它们将在执行集体操作时重新聚合,但之后可能再次分歧)。

class coalesced_group;

Constructed via: 通过构建:

coalesced_group active = coalesced_threads();

Public Member Functions: 公共成员函数:

void sync() const: Synchronize the threads named in the group
void sync() const :同步组中命名的线程

unsigned long long num_threads() const: Total number of threads in the group
unsigned long long num_threads() const :组中的线程总数

unsigned long long thread_rank() const: Rank of the calling thread within [0, num_threads)
unsigned long long thread_rank() const :调用线程在[0,num_threads)范围内的排名

unsigned long long meta_group_size() const: Returns the number of groups created when the parent group was partitioned. If this group was created by querying the set of active threads, e.g. coalesced_threads() the value of meta_group_size() will be 1.
unsigned long long meta_group_size() const :返回父组分区时创建的组数。如果此组是通过查询活动线程集创建的,例如 coalesced_threads() ,则 meta_group_size() 的值将为 1。

unsigned long long meta_group_rank() const: Linear rank of the group within the set of tiles partitioned from a parent group (bounded by meta_group_size). If this group was created by querying the set of active threads, e.g. coalesced_threads() the value of meta_group_rank() will always be 0.
unsigned long long meta_group_rank() const :在从父组分区的瓦片集合中的线性等级(受 meta_group_size 限制)。如果此组是通过查询活动线程集合创建的,例如 coalesced_threads() ,则 meta_group_rank() 的值将始终为 0。

T shfl(T var, unsigned int src_rank) const: Refer to Warp Shuffle Functions
T shfl(T var, unsigned int src_rank) const :参考 Warp Shuffle 函数

T shfl_up(T var, int delta) const: Refer to Warp Shuffle Functions
T shfl_up(T var, int delta) const :参考 Warp Shuffle 函数

T shfl_down(T var, int delta) const: Refer to Warp Shuffle Functions
T shfl_down(T var, int delta) const :参考 Warp Shuffle 函数

T any(int predicate) const: Refer to Warp Vote Functions
T any(int predicate) const :参考 Warp 投票功能

T all(int predicate) const: Refer to Warp Vote Functions
T all(int predicate) const :参考 Warp 投票功能

T ballot(int predicate) const: Refer to Warp Vote Functions
T ballot(int predicate) const :参考 Warp 投票功能

unsigned int match_any(T val) const: Refer to Warp Match Functions
unsigned int match_any(T val) const :参考 Warp 匹配函数

unsigned int match_all(T val, int &pred) const: Refer to Warp Match Functions
unsigned int match_all(T val, int &pred) const :参考 Warp 匹配函数

Legacy member functions (aliases):
传统成员函数(别名):

unsigned long long size() const: Total number of threads in the group (alias of num_threads())
unsigned long long size() const :组中的线程总数( num_threads() 的别名)

Notes: 注意事项:

shfl, shfl_up, and shfl_down functions accept objects of any type when compiled with C++11 or later. This means it’s possible to shuffle non-integral types as long as they satisfy the below constraints:
shfl, shfl_up, and shfl_down 函数在使用 C++11 或更高版本编译时接受任何类型的对象。这意味着可以对非整数类型进行洗牌,只要它们满足以下约束条件:

  • Qualifies as trivially copyable i.e. is_trivially_copyable<T>::value == true
    符合可以被平凡复制的条件,即 is_trivially_copyable<T>::value == true

  • sizeof(T) <= 32

Example: 示例:

/// Consider a situation whereby there is a branch in the
/// code in which only the 2nd, 4th and 8th threads in each warp are
/// active. The coalesced_threads() call, placed in that branch, will create (for each
/// warp) a group, active, that has three threads (with
/// ranks 0-2 inclusive).
__global__ void kernel(int *globalInput) {
    // Lets say globalInput says that threads 2, 4, 8 should handle the data
    if (threadIdx.x == *globalInput) {
        coalesced_group active = coalesced_threads();
        // active contains 0-2 inclusive
        active.sync();
    }
}
8.4.2.2.1. Discovery Pattern
8.4.2.2.1. 发现模式 

Commonly developers need to work with the current active set of threads. No assumption is made about the threads that are present, and instead developers work with the threads that happen to be there. This is seen in the following “aggregating atomic increment across threads in a warp” example (written using the correct CUDA 9.0 set of intrinsics):
通常开发人员需要处理当前活动的线程集。不假设存在哪些线程,而是开发人员处理那些恰好存在的线程。这在以下“在 warp 中跨线程聚合原子增量”示例中得以体现(使用正确的 CUDA 9.0 内部函数集编写):

{
    unsigned int writemask = __activemask();
    unsigned int total = __popc(writemask);
    unsigned int prefix = __popc(writemask & __lanemask_lt());
    // Find the lowest-numbered active lane
    int elected_lane = __ffs(writemask) - 1;
    int base_offset = 0;
    if (prefix == 0) {
        base_offset = atomicAdd(p, total);
    }
    base_offset = __shfl_sync(writemask, base_offset, elected_lane);
    int thread_offset = prefix + base_offset;
    return thread_offset;
}

This can be re-written with Cooperative Groups as follows:
这可以使用协作组进行重写,如下所示:

{
    cg::coalesced_group g = cg::coalesced_threads();
    int prev;
    if (g.thread_rank() == 0) {
        prev = atomicAdd(p, g.num_threads());
    }
    prev = g.thread_rank() + g.shfl(prev, 0);
    return prev;
}

8.5. Group Partitioning
8.5. 组分区 

8.5.1. tiled_partition

template <unsigned int Size, typename ParentT>
thread_block_tile<Size, ParentT> tiled_partition(const ParentT& g);
thread_group tiled_partition(const thread_group& parent, unsigned int tilesz);

The tiled_partition method is a collective operation that partitions the parent group into a one-dimensional, row-major, tiling of subgroups. A total of ((size(parent)/tilesz) subgroups will be created, therefore the parent group size must be evenly divisible by the Size. The allowed parent groups are thread_block or thread_block_tile.
tiled_partition 方法是一种集体操作,将父组分割为一维、按行主序排列的子组划分。将创建总共 ((size(parent)/tilesz)) 个子组,因此父组大小必须能够被 Size 均匀整除。允许的父组为 thread_blockthread_block_tile

The implementation may cause the calling thread to wait until all the members of the parent group have invoked the operation before resuming execution. Functionality is limited to native hardware sizes, 1/2/4/8/16/32 and the cg::size(parent) must be greater than the Size parameter. The templated version of tiled_partition supports 64/128/256/512 sizes as well, but some additional steps are required on Compute Capability 7.5 or lower, refer to Thread Block Tile for details.
实现可能会导致调用线程等待,直到父组的所有成员在恢复执行之前调用操作。 功能仅限于本机硬件大小,1/2/4/8/16/32, cg::size(parent) 必须大于 Size 参数。 tiled_partition 的模板化版本还支持 64/128/256/512 大小,但在 Compute Capability 7.5 或更低版本上需要一些额外步骤,请参考线程块瓷砖以获取详细信息。

Codegen Requirements: Compute Capability 5.0 minimum, C++11 for sizes larger than 32
代码生成要求:计算能力不低于 5.0,对于大于 32 的尺寸需要 C++11

Example: 示例:

/// The following code will create a 32-thread tile
thread_block block = this_thread_block();
thread_block_tile<32> tile32 = tiled_partition<32>(block);

We can partition each of these groups into even smaller groups, each of size 4 threads:
我们可以将这些组中的每一个分成更小的组,每个组包含 4 个线程:

auto tile4 = tiled_partition<4>(tile32);
// or using a general group
// thread_group tile4 = tiled_partition(tile32, 4);

If, for instance, if we were to then include the following line of code:
如果,例如,如果我们随后包含以下代码行:

if (tile4.thread_rank()==0) printf("Hello from tile4 rank 0\n");

then the statement would be printed by every fourth thread in the block: the threads of rank 0 in each tile4 group, which correspond to those threads with ranks 0,4,8,12,etc. in the block group.
那么该语句将由块中每第四个线程打印:每个 tile4 组中排名为 0 的线程,对应于 block 组中排名为 0,4,8,12 等的线程。

8.5.2. labeled_partition

template <typename Label>
coalesced_group labeled_partition(const coalesced_group& g, Label label);
template <unsigned int Size, typename Label>
coalesced_group labeled_partition(const thread_block_tile<Size>& g, Label label);

The labeled_partition method is a collective operation that partitions the parent group into one-dimensional subgroups within which the threads are coalesced. The implementation will evaluate a condition label and assign threads that have the same value for label into the same group.
labeled_partition 方法是一种集体操作,将父组分成一维子组,在这些子组中,线程被合并。实现将评估条件标签,并将具有相同标签值的线程分配到同一组中。

Label can be any integral type.
Label 可以是任何整数类型。

The implementation may cause the calling thread to wait until all the members of the parent group have invoked the operation before resuming execution.
实现可能会导致调用线程等待,直到父组的所有成员在恢复执行之前调用了操作。

Note: This functionality is still being evaluated and may slightly change in the future.
注意:此功能仍在评估中,可能会在未来略有更改。

Codegen Requirements: Compute Capability 7.0 minimum, C++11
代码生成要求:计算能力不低于 7.0,C++11

8.5.3. binary_partition

coalesced_group binary_partition(const coalesced_group& g, bool pred);
template <unsigned int Size>
coalesced_group binary_partition(const thread_block_tile<Size>& g, bool pred);

The binary_partition() method is a collective operation that partitions the parent group into one-dimensional subgroups within which the threads are coalesced. The implementation will evaluate a predicate and assign threads that have the same value into the same group. This is a specialized form of labeled_partition(), where the label can only be 0 or 1.
binary_partition() 方法是一种集体操作,将父组分成一维子组,在其中线程被合并。实现将评估一个谓词,并将具有相同值的线程分配到同一组中。这是 labeled_partition() 的一种专门形式,其中标签只能是 0 或 1。

The implementation may cause the calling thread to wait until all the members of the parent group have invoked the operation before resuming execution.
实现可能会导致调用线程等待,直到父组的所有成员在恢复执行之前调用了操作。

Note: This functionality is still being evaluated and may slightly change in the future.
注意:此功能仍在评估中,可能会在未来略有更改。

Codegen Requirements: Compute Capability 7.0 minimum, C++11
代码生成要求:计算能力不低于 7.0,C++11

Example: 示例:

/// This example divides a 32-sized tile into a group with odd
/// numbers and a group with even numbers
_global__ void oddEven(int *inputArr) {
    auto block = cg::this_thread_block();
    auto tile32 = cg::tiled_partition<32>(block);

    // inputArr contains random integers
    int elem = inputArr[block.thread_rank()];
    // after this, tile32 is split into 2 groups,
    // a subtile where elem&1 is true and one where its false
    auto subtile = cg::binary_partition(tile32, (elem & 1));
}

8.6. Group Collectives
8.6. 组集合 

Cooperative Groups library provides a set of collective operations that can be performed by a group of threads. These operations require participation of all threads in the specified group in order to complete the operation. All threads in the group need to pass the same values for corresponding arguments to each collective call, unless different values are explicitly allowed in the argument description. Otherwise the behavior of the call is undefined.
合作组库提供了一组可以由一组线程执行的集体操作。这些操作需要指定组中所有线程的参与才能完成操作。组中的所有线程需要为每个集体调用的相应参数传递相同的值,除非参数描述中明确允许使用不同的值。否则,调用的行为是未定义的。

8.6.1. Synchronization
8.6.1. 同步 

8.6.1.1. barrier_arrive and barrier_wait
8.6.1.1. barrier_arrivebarrier_wait

T::arrival_token T::barrier_arrive();
void T::barrier_wait(T::arrival_token&&);

barrier_arrive and barrier_wait member functions provide a synchronization API similar to cuda::barrier (read more). Cooperative Groups automatically initializes the group barrier, but arrive and wait operations have an additional restriction resulting from collective nature of those operations: All threads in the group must arrive and wait at the barrier once per phase. When barrier_arrive is called with a group, result of calling any collective operation or another barrier arrival with that group is undefined until completion of the barrier phase is observed with barrier_wait call. Threads blocked on barrier_wait might be released from the synchronization before other threads call barrier_wait, but only after all threads in the group called barrier_arrive. Group type T can be any of the implicit groups .This allows threads to do independent work after they arrive and before they wait for the synchronization to resolve, allowing to hide some of the synchronization latency. barrier_arrive returns an arrival_token object that must be passed into the corresponding barrier_wait. Token is consumed this way and can not be used for another barrier_wait call.
barrier_arrivebarrier_wait 成员函数提供了类似于 cuda::barrier 的同步 API(阅读更多)。协作组自动初始化组屏障,但到达和等待操作由于这些操作的集体性质而有额外的限制:组中的所有线程必须在每个阶段一次到达并等待在屏障处。当使用组调用 barrier_arrive 时,使用该组调用任何集体操作或另一个屏障到达的结果在观察到屏障阶段的完成时是未定义的,直到使用 barrier_wait 调用。在 barrier_wait 上阻塞的线程可能会在其他线程调用 barrier_wait 之前释放同步,但只有在组中的所有线程都调用 barrier_arrive 之后才会释放。组类型 T 可以是任何隐式组。这允许线程在到达后和等待同步解决之前做独立的工作,从而隐藏一些同步延迟。 barrier_arrive 返回一个 arrival_token 对象,必须传递给相应的 barrier_wait 。令牌以这种方式被消耗,不能用于另一个 barrier_wait 调用。

Example of barrier_arrive and barrier_wait used to synchronize initization of shared memory across the cluster:
barrier_arrive 和 barrier_wait 的示例,用于在整个集群中同步共享内存的初始化:

#include <cooperative_groups.h>

using namespace cooperative_groups;

void __device__ init_shared_data(const thread_block& block, int *data);
void __device__ local_processing(const thread_block& block);
void __device__ process_shared_data(const thread_block& block, int *data);

__global__ void cluster_kernel() {
    extern __shared__ int array[];
    auto cluster = this_cluster();
    auto block   = this_thread_block();

    // Use this thread block to initialize some shared state
    init_shared_data(block, &array[0]);

    auto token = cluster.barrier_arrive(); // Let other blocks know this block is running and data was initialized

    // Do some local processing to hide the synchronization latency
    local_processing(block);

    // Map data in shared memory from the next block in the cluster
    int *dsmem = cluster.map_shared_rank(&array[0], (cluster.block_rank() + 1) % cluster.num_blocks());

    // Make sure all other blocks in the cluster are running and initialized shared data before accessing dsmem
    cluster.barrier_wait(std::move(token));

    // Consume data in distributed shared memory
    process_shared_data(block, dsmem);
    cluster.sync();
}

8.6.1.2. sync

static void T::sync();

template <typename T>
void sync(T& group);

sync synchronizes the threads named in the group. Group type T can be any of the existing group types, as all of them support synchronization. Its available as a member function in every group type or as a free function taking a group as parameter. If the group is a grid_group or a multi_grid_group the kernel must have been launched using the appropriate cooperative launch APIs. Equivalent to T.barrier_wait(T.barrier_arrive()).
sync 同步组中命名的线程。组类型 T 可以是任何现有组类型,因为它们都支持同步。它作为每种组类型的成员函数可用,也可以作为一个以组为参数的自由函数。如果组是 grid_groupmulti_grid_group ,则内核必须使用适当的协作启动 API 进行启动。等效于 T.barrier_wait(T.barrier_arrive())

8.6.2. Data Transfer
8.6.2. 数据传输 

8.6.2.1. memcpy_async

memcpy_async is a group-wide collective memcpy that utilizes hardware accelerated support for non-blocking memory transactions from global to shared memory. Given a set of threads named in the group, memcpy_async will move specified amount of bytes or elements of the input type through a single pipeline stage. Additionally for achieving best performance when using the memcpy_async API, an alignment of 16 bytes for both shared memory and global memory is required. It is important to note that while this is a memcpy in the general case, it is only asynchronous if the source is global memory and the destination is shared memory and both can be addressed with 16, 8, or 4 byte alignments. Asynchronously copied data should only be read following a call to wait or wait_prior which signals that the corresponding stage has completed moving data to shared memory.
memcpy_async 是一个全组范围的集体 memcpy,利用硬件加速支持从全局到共享内存的非阻塞内存事务。给定一组命名为组的线程, memcpy_async 将通过单个管道阶段移动指定数量的字节或输入类型的元素。此外,在使用 memcpy_async API 时实现最佳性能时,共享内存和全局内存都需要 16 字节的对齐。需要注意的是,虽然这是一般情况下的 memcpy,但只有在源是全局内存且目的地是共享内存且两者都可以用 16、8 或 4 字节对齐时才是异步的。异步复制的数据应该在调用 wait 或 wait_prior 后才能读取,这会表示相应的阶段已完成将数据移动到共享内存。

Having to wait on all outstanding requests can lose some flexibility (but gain simplicity). In order to efficiently overlap data transfer and execution, its important to be able to kick off an N+1memcpy_async request while waiting on and operating on request N. To do so, use memcpy_async and wait on it using the collective stage-based wait_prior API. See wait and wait_prior for more details.
必须等待所有未完成的请求可能会失去一些灵活性(但获得简单性)。为了有效地重叠数据传输和执行,重要的是能够在等待和处理请求 N 的同时启动 N+1 memcpy_async 请求。为此,请使用 memcpy_async 并使用基于阶段的 wait_prior API 进行等待。有关更多详细信息,请参见 wait 和 wait_prior。

Usage 1 用法 1

template <typename TyGroup, typename TyElem, typename TyShape>
void memcpy_async(
  const TyGroup &group,
  TyElem *__restrict__ _dst,
  const TyElem *__restrict__ _src,
  const TyShape &shape
);

Performs a copy of ``shape`` bytes.
执行对“shape”字节的复制。

Usage 2 用法 2

template <typename TyGroup, typename TyElem, typename TyDstLayout, typename TySrcLayout>
void memcpy_async(
  const TyGroup &group,
  TyElem *__restrict__ dst,
  const TyDstLayout &dstLayout,
  const TyElem *__restrict__ src,
  const TySrcLayout &srcLayout
);

Performs a copy of ``min(dstLayout, srcLayout)`` elements. If layouts are of type cuda::aligned_size_t<N>, both must specify the same alignment.
执行“min(dstLayout,srcLayout)”元素的复制。如果布局是 cuda::aligned_size_t<N> 类型,则两者必须指定相同的对齐方式。

Errata The memcpy_async API introduced in CUDA 11.1 with both src and dst input layouts, expects the layout to be provided in elements rather than bytes. The element type is inferred from TyElem and has the size sizeof(TyElem). If cuda::aligned_size_t<N> type is used as the layout, the number of elements specified times sizeof(TyElem) must be a multiple of N and it is recommended to use std::byte or char as the element type.
勘误 CUDA 11.1 中引入的 memcpy_async API,具有 src 和 dst 输入布局,期望以元素而不是字节的形式提供布局。元素类型从 TyElem 推断,并具有大小 sizeof(TyElem) 。如果使用 cuda::aligned_size_t<N> 类型作为布局,则指定的元素数量乘以 sizeof(TyElem) 必须是 N 的倍数,并建议使用 std::bytechar 作为元素类型。

If specified shape or layout of the copy is of type cuda::aligned_size_t<N>, alignment will be guaranteed to be at least min(16, N). In that case both dst and src pointers need to be aligned to N bytes and the number of bytes copied needs to be a multiple of N.
如果指定的形状或布局的副本类型为 cuda::aligned_size_t<N> ,则保证对齐至少为 min(16, N) 。在这种情况下, dstsrc 指针都需要对齐到 N 字节,并且复制的字节数需要是 N 的倍数。

Codegen Requirements: Compute Capability 5.0 minimum, Compute Capability 8.0 for asynchronicity, C++11
代码生成要求:最低计算能力为 5.0,用于异步性的计算能力为 8.0,C++11

cooperative_groups/memcpy_async.h header needs to be included.
需要包含 cooperative_groups/memcpy_async.h 标头。

Example: 示例:

/// This example streams elementsPerThreadBlock worth of data from global memory
/// into a limited sized shared memory (elementsInShared) block to operate on.
#include <cooperative_groups.h>
#include <cooperative_groups/memcpy_async.h>

namespace cg = cooperative_groups;

__global__ void kernel(int* global_data) {
    cg::thread_block tb = cg::this_thread_block();
    const size_t elementsPerThreadBlock = 16 * 1024;
    const size_t elementsInShared = 128;
    __shared__ int local_smem[elementsInShared];

    size_t copy_count;
    size_t index = 0;
    while (index < elementsPerThreadBlock) {
        cg::memcpy_async(tb, local_smem, elementsInShared, global_data + index, elementsPerThreadBlock - index);
        copy_count = min(elementsInShared, elementsPerThreadBlock - index);
        cg::wait(tb);
        // Work with local_smem
        index += copy_count;
    }
}

8.6.2.2. wait and wait_prior

template <typename TyGroup>
void wait(TyGroup & group);

template <unsigned int NumStages, typename TyGroup>
void wait_prior(TyGroup & group);

wait and wait_prior collectives allow to wait for memcpy_async copies to complete. wait blocks calling threads until all previous copies are done. wait_prior allows that the latest NumStages are still not done and waits for all the previous requests. So with N total copies requested, it waits until the first N-NumStages are done and the last NumStages might still be in progress. Both wait and wait_prior will synchronize the named group.
waitwait_prior 集合允许等待 memcpy_async 复制完成。 wait 阻止调用线程,直到所有先前的复制完成。 wait_prior 允许最新的 NumStages 仍未完成,并等待所有先前的请求。因此,总共请求 N 次复制,它会等待直到前 N-NumStages 个完成,最后一个 NumStages 可能仍在进行中。 waitwait_prior 都会同步命名组。

Codegen Requirements: Compute Capability 5.0 minimum, Compute Capability 8.0 for asynchronicity, C++11
代码生成要求:最低计算能力为 5.0,用于异步性的计算能力为 8.0,C++11

cooperative_groups/memcpy_async.h header needs to be included.
需要包含 cooperative_groups/memcpy_async.h 标头。

Example: 示例:

/// This example streams elementsPerThreadBlock worth of data from global memory
/// into a limited sized shared memory (elementsInShared) block to operate on in
/// multiple (two) stages. As stage N is kicked off, we can wait on and operate on stage N-1.
#include <cooperative_groups.h>
#include <cooperative_groups/memcpy_async.h>

namespace cg = cooperative_groups;

__global__ void kernel(int* global_data) {
    cg::thread_block tb = cg::this_thread_block();
    const size_t elementsPerThreadBlock = 16 * 1024 + 64;
    const size_t elementsInShared = 128;
    __align__(16) __shared__ int local_smem[2][elementsInShared];
    int stage = 0;
    // First kick off an extra request
    size_t copy_count = elementsInShared;
    size_t index = copy_count;
    cg::memcpy_async(tb, local_smem[stage], elementsInShared, global_data, elementsPerThreadBlock - index);
    while (index < elementsPerThreadBlock) {
        // Now we kick off the next request...
        cg::memcpy_async(tb, local_smem[stage ^ 1], elementsInShared, global_data + index, elementsPerThreadBlock - index);
        // ... but we wait on the one before it
        cg::wait_prior<1>(tb);

        // Its now available and we can work with local_smem[stage] here
        // (...)
        //

        // Calculate the amount fo data that was actually copied, for the next iteration.
        copy_count = min(elementsInShared, elementsPerThreadBlock - index);
        index += copy_count;

        // A cg::sync(tb) might be needed here depending on whether
        // the work done with local_smem[stage] can release threads to race ahead or not
        // Wrap to the next stage
        stage ^= 1;
    }
    cg::wait(tb);
    // The last local_smem[stage] can be handled here
}

8.6.3. Data Manipulation
8.6.3. 数据操作 

8.6.3.1. reduce

template <typename TyGroup, typename TyArg, typename TyOp>
auto reduce(const TyGroup& group, TyArg&& val, TyOp&& op) -> decltype(op(val, val));

reduce performs a reduction operation on the data provided by each thread named in the group passed in. This takes advantage of hardware acceleration (on compute 80 and higher devices) for the arithmetic add, min, or max operations and the logical AND, OR, or XOR, as well as providing a software fallback on older generation hardware. Only 4B types are accelerated by hardware.
reduce 对传入的组中每个线程提供的数据执行减少操作。这利用了硬件加速(在计算 80 及更高设备上)进行算术加法、最小值或最大值操作以及逻辑 AND、OR 或 XOR,并在较旧的硬件上提供软件回退。只有 4B 类型受到硬件加速。

group: Valid group types are coalesced_group and thread_block_tile.
group :有效的组类型为 coalesced_groupthread_block_tile

val: Any type that satisfies the below requirements:
val :满足以下要求的任何类型:

  • Qualifies as trivially copyable i.e. is_trivially_copyable<TyArg>::value == true
    符合可以被平凡复制的条件,即 is_trivially_copyable<TyArg>::value == true

  • sizeof(T) <= 32 for coalesced_group and tiles of size lower or equal 32, sizeof(T) <= 8 for larger tiles
    sizeof(T) <= 32 适用于 coalesced_group 和尺寸小于或等于 32 的瓦片, sizeof(T) <= 8 适用于较大的瓦片

  • Has suitable arithmetic or comparative operators for the given function object.
    具有适当的算术或比较运算符,适用于给定的函数对象。

Note: Different threads in the group can pass different values for this argument.
注意:组中的不同线程可以传递不同的值给这个参数。

op: Valid function objects that will provide hardware acceleration with integral types are plus(), less(), greater(), bit_and(), bit_xor(), bit_or(). These must be constructed, hence the TyVal template argument is required, i.e. plus<int>(). Reduce also supports lambdas and other function objects that can be invoked using operator()
op :可以提供整数类型硬件加速的有效函数对象为 plus(), less(), greater(), bit_and(), bit_xor(), bit_or() 。这些必须被构造,因此需要 TyVal 模板参数,即 plus<int>() 。Reduce 还支持可以使用 operator() 调用的 lambda 和其他函数对象。

Asynchronous reduce 异步减少

template <typename TyGroup, typename TyArg, typename TyAtomic, typename TyOp>
void reduce_update_async(const TyGroup& group, TyAtomic& atomic, TyArg&& val, TyOp&& op);

template <typename TyGroup, typename TyArg, typename TyAtomic, typename TyOp>
void reduce_store_async(const TyGroup& group, TyAtomic& atomic, TyArg&& val, TyOp&& op);

template <typename TyGroup, typename TyArg, typename TyOp>
void reduce_store_async(const TyGroup& group, TyArg* ptr, TyArg&& val, TyOp&& op);

*_async variants of the API are asynchronously calculating the result to either store to or update a specified destination by one of the participating threads, instead of returning it by each thread. To observe the effect of these asynchronous calls, calling group of threads or a larger group containing them need to be synchronized.
API 的 *_async 个变体正在异步计算结果,以便由参与线程之一存储或更新到指定目的地,而不是由每个线程返回。为了观察这些异步调用的效果,需要同步调用线程组或包含它们的较大组。

  • In case of the atomic store or update variant, atomic argument can be either of cuda::atomic or cuda::atomic_ref available in CUDA C++ Standard Library. This variant of the API is available only on platforms and devices, where these types are supported by the CUDA C++ Standard Library. Result of the reduction is used to atomically update the atomic according to the specified op, eg. the result is atomically added to the atomic in case of cg::plus(). Type held by the atomic must match the type of TyArg. Scope of the atomic must include all the threads in the group and if multiple groups are using the same atomic concurrently, scope must include all threads in all groups using it. Atomic update is performed with relaxed memory ordering.
    在原子存储或更新变体的情况下, atomic 参数可以是 CUDA C++ 标准库中可用的 cuda::atomiccuda::atomic_ref 之一。此 API 的这种变体仅在支持这些类型的平台和设备上才可用 CUDA C++ 标准库。归约的结果用于根据指定的 op 原子更新原子,例如,在 cg::plus() 的情况下,结果会被原子地添加到原子中。 atomic 持有的类型必须与 TyArg 的类型匹配。原子的范围必须包括组中的所有线程,如果多个组同时使用同一个原子,则范围必须包括所有使用它的所有组中的线程。使用松散的内存顺序执行原子更新。

  • In case of the pointer store variant, result of the reduction will be weakly stored into the dst pointer.
    在指针存储变体的情况下,减少的结果将被弱存储到 dst 指针中。

Codegen Requirements: Compute Capability 5.0 minimum, Compute Capability 8.0 for HW acceleration, C++11.
代码生成要求:最低计算能力为 5.0,硬件加速需要计算能力为 8.0,使用 C++11。

cooperative_groups/reduce.h header needs to be included.
需要包含 cooperative_groups/reduce.h 标头。

Example of approximate standard deviation for integer vector:
整数向量的近似标准差示例:

#include <cooperative_groups.h>
#include <cooperative_groups/reduce.h>
namespace cg = cooperative_groups;

/// Calculate approximate standard deviation of integers in vec
__device__ int std_dev(const cg::thread_block_tile<32>& tile, int *vec, int length) {
    int thread_sum = 0;

    // calculate average first
    for (int i = tile.thread_rank(); i < length; i += tile.num_threads()) {
        thread_sum += vec[i];
    }
    // cg::plus<int> allows cg::reduce() to know it can use hardware acceleration for addition
    int avg = cg::reduce(tile, thread_sum, cg::plus<int>()) / length;

    int thread_diffs_sum = 0;
    for (int i = tile.thread_rank(); i < length; i += tile.num_threads()) {
        int diff = vec[i] - avg;
        thread_diffs_sum += diff * diff;
    }

    // temporarily use floats to calculate the square root
    float diff_sum = static_cast<float>(cg::reduce(tile, thread_diffs_sum, cg::plus<int>())) / length;

    return static_cast<int>(sqrtf(diff_sum));
}

Example of block wide reduction:
块宽减少示例:

#include <cooperative_groups.h>
#include <cooperative_groups/reduce.h>
namespace cg=cooperative_groups;

/// The following example accepts input in *A and outputs a result into *sum
/// It spreads the data equally within the block
__device__ void block_reduce(const int* A, int count, cuda::atomic<int, cuda::thread_scope_block>& total_sum) {
    auto block = cg::this_thread_block();
    auto tile = cg::tiled_partition<32>(block);
    int thread_sum = 0;

    // Stride loop over all values, each thread accumulates its part of the array.
    for (int i = block.thread_rank(); i < count; i += block.size()) {
        thread_sum += A[i];
    }

    // reduce thread sums across the tile, add the result to the atomic
    // cg::plus<int> allows cg::reduce() to know it can use hardware acceleration for addition
 cg::reduce_update_async(tile, total_sum, thread_sum, cg::plus<int>());

 // synchronize the block, to ensure all async reductions are ready
    block.sync();
}

8.6.3.2. Reduce Operators

Below are the prototypes of function objects for some of the basic operations that can be done with reduce
以下是一些可以使用 reduce 执行的基本操作的函数对象原型

namespace cooperative_groups {
  template <typename Ty>
  struct cg::plus;

  template <typename Ty>
  struct cg::less;

  template <typename Ty>
  struct cg::greater;

  template <typename Ty>
  struct cg::bit_and;

  template <typename Ty>
  struct cg::bit_xor;

  template <typename Ty>
  struct cg::bit_or;
}

Reduce is limited to the information available to the implementation at compile time. Thus in order to make use of intrinsics introduced in CC 8.0, the cg:: namespace exposes several functional objects that mirror the hardware. These objects appear similar to those presented in the C++ STL, with the exception of less/greater. The reason for any difference from the STL is that these function objects are designed to actually mirror the operation of the hardware intrinsics.
Reduce 仅限于编译时实现可用的信息。因此,为了利用 CC 8.0 中引入的内部函数, cg:: 命名空间公开了几个功能对象,这些对象反映了硬件的特性。这些对象看起来类似于 C++ STL 中提供的对象,除了 less/greater 。与 STL 有任何不同之处的原因是,这些函数对象旨在实际反映硬件内部函数的操作。

Functional description: 功能描述:

  • cg::plus: Accepts two values and returns the sum of both using operator+.
    cg::plus: 接受两个值,并使用 operator+返回两者的和。

  • cg::less: Accepts two values and returns the lesser using operator<. This differs in that the lower value is returned rather than a Boolean.
    cg::less: 接受两个值,并使用 operator<返回较小的值。与返回布尔值不同,这里返回较小的值。

  • cg::greater: Accepts two values and returns the greater using operator<. This differs in that the greater value is returned rather than a Boolean.
    cg::greater: 接受两个值,并使用 operator<返回较大的值。不同之处在于返回较大的值,而不是布尔值。

  • cg::bit_and: Accepts two values and returns the result of operator&.
    cg::bit_and: 接受两个值并返回 operator&的结果。

  • cg::bit_xor: Accepts two values and returns the result of operator^.
    cg::bit_xor: 接受两个值并返回 operator^的结果。

  • cg::bit_or: Accepts two values and returns the result of operator|.
    cg::bit_or: 接受两个值并返回运算符|的结果。

Example: 示例:

{
    // cg::plus<int> is specialized within cg::reduce and calls __reduce_add_sync(...) on CC 8.0+
    cg::reduce(tile, (int)val, cg::plus<int>());

    // cg::plus<float> fails to match with an accelerator and instead performs a standard shuffle based reduction
    cg::reduce(tile, (float)val, cg::plus<float>());

    // While individual components of a vector are supported, reduce will not use hardware intrinsics for the following
    // It will also be necessary to define a corresponding operator for vector and any custom types that may be used
    int4 vec = {...};
    cg::reduce(tile, vec, cg::plus<int4>())

    // Finally lambdas and other function objects cannot be inspected for dispatch
    // and will instead perform shuffle based reductions using the provided function object.
    cg::reduce(tile, (int)val, [](int l, int r) -> int {return l + r;});
}

8.6.3.3. inclusive_scan and exclusive_scan
8.6.3.3. inclusive_scanexclusive_scan

template <typename TyGroup, typename TyVal, typename TyFn>
auto inclusive_scan(const TyGroup& group, TyVal&& val, TyFn&& op) -> decltype(op(val, val));

template <typename TyGroup, typename TyVal>
TyVal inclusive_scan(const TyGroup& group, TyVal&& val);

template <typename TyGroup, typename TyVal, typename TyFn>
auto exclusive_scan(const TyGroup& group, TyVal&& val, TyFn&& op) -> decltype(op(val, val));

template <typename TyGroup, typename TyVal>
TyVal exclusive_scan(const TyGroup& group, TyVal&& val);

inclusive_scan and exclusive_scan performs a scan operation on the data provided by each thread named in the group passed in. Result for each thread is a reduction of data from threads with lower thread_rank than that thread in case of exclusive_scan. inclusive_scan result also includes the calling thread data in the reduction.
inclusive_scanexclusive_scan 对传入的组中每个线程提供的数据执行扫描操作。每个线程的结果是对低于该线程的线程的数据进行归约,如果 exclusive_scan 的情况下。 inclusive_scan 结果还包括调用线程的数据在归约中。

group: Valid group types are coalesced_group and thread_block_tile.
group :有效的组类型为 coalesced_groupthread_block_tile

val: Any type that satisfies the below requirements:
val :满足以下要求的任何类型:

  • Qualifies as trivially copyable i.e. is_trivially_copyable<TyArg>::value == true
    符合可以被平凡复制的条件,即 is_trivially_copyable<TyArg>::value == true

  • sizeof(T) <= 32 for coalesced_group and tiles of size lower or equal 32, sizeof(T) <= 8 for larger tiles
    sizeof(T) <= 32 适用于 coalesced_group 和尺寸小于或等于 32 的瓦片, sizeof(T) <= 8 适用于较大的瓦片

  • Has suitable arithmetic or comparative operators for the given function object.
    具有适当的算术或比较运算符,适用于给定的函数对象。

Note: Different threads in the group can pass different values for this argument.
注意:组中的不同线程可以传递不同的值给这个参数。

op: Function objects defined for convenience are plus(), less(), greater(), bit_and(), bit_xor(), bit_or() described in Reduce Operators. These must be constructed, hence the TyVal template argument is required, i.e. plus<int>(). inclusive_scan and exclusive_scan also supports lambdas and other function objects that can be invoked using operator(). Overloads without this argument use cg::plus<TyVal>().
op :为方便起见定义的函数对象在减少运算符中 plus(), less(), greater(), bit_and(), bit_xor(), bit_or() 。这些必须被构造,因此需要 TyVal 模板参数,即 plus<int>()inclusive_scanexclusive_scan 还支持可以使用 operator() 调用的 lambda 和其他函数对象。没有此参数的重载使用 cg::plus<TyVal>()

Scan update 扫描更新

template <typename TyGroup, typename TyAtomic, typename TyVal, typename TyFn>
auto inclusive_scan_update(const TyGroup& group, TyAtomic& atomic, TyVal&& val, TyFn&& op) -> decltype(op(val, val));

template <typename TyGroup, typename TyAtomic, typename TyVal>
TyVal inclusive_scan_update(const TyGroup& group, TyAtomic& atomic, TyVal&& val);

template <typename TyGroup, typename TyAtomic, typename TyVal, typename TyFn>
auto exclusive_scan_update(const TyGroup& group, TyAtomic& atomic, TyVal&& val, TyFn&& op) -> decltype(op(val, val));

template <typename TyGroup, typename TyAtomic, typename TyVal>
TyVal exclusive_scan_update(const TyGroup& group, TyAtomic& atomic, TyVal&& val);

*_scan_update collectives take an additional argument atomic that can be either of cuda::atomic or cuda::atomic_ref available in CUDA C++ Standard Library. These variants of the API are available only on platforms and devices, where these types are supported by the CUDA C++ Standard Library. These variants will perform an update to the atomic according to op with value of the sum of input values of all threads in the group. Previous value of the atomic will be combined with the result of scan by each thread and returned. Type held by the atomic must match the type of TyVal. Scope of the atomic must include all the threads in the group and if multiple groups are using the same atomic concurrently, scope must include all threads in all groups using it. Atomic update is performed with relaxed memory ordering.
*_scan_update 集合体需要一个额外的参数 atomic ,可以是 CUDA C++ 标准库中的 cuda::atomiccuda::atomic_ref 。这些 API 的变体仅在支持这些类型的平台和设备上可用 CUDA C++ 标准库。这些变体将根据所有线程输入值的总和的值对 atomic 执行更新。 atomic 的先前值将与每个线程的扫描结果组合并返回。 atomic 持有的类型必须与 TyVal 的类型匹配。原子操作的范围必须包括组中的所有线程,如果多个组同时使用相同的原子操作,则范围必须包括所有使用它的所有组中的所有线程。使用松散的内存排序执行原子更新。

Following pseudocode illustrates how the update variant of scan works:
以下伪代码说明了扫描的更新变体是如何工作的:

/*
 inclusive_scan_update behaves as the following block,
 except both reduce and inclusive_scan is calculated simultaneously.
auto total = reduce(group, val, op);
TyVal old;
if (group.thread_rank() == selected_thread) {
    atomicaly {
        old = atomic.load();
        atomic.store(op(old, total));
    }
}
old = group.shfl(old, selected_thread);
return op(inclusive_scan(group, val, op), old);
*/

Codegen Requirements: Compute Capability 5.0 minimum, C++11.
代码生成要求:最低计算能力为 5.0,C++11。

cooperative_groups/scan.h header needs to be included.
需要包含 cooperative_groups/scan.h 标头。

Example: 示例:

#include <stdio.h>
#include <cooperative_groups.h>
#include <cooperative_groups/scan.h>
namespace cg = cooperative_groups;

__global__ void kernel() {
    auto thread_block = cg::this_thread_block();
    auto tile = cg::tiled_partition<8>(thread_block);
    unsigned int val = cg::inclusive_scan(tile, tile.thread_rank());
    printf("%u: %u\n", tile.thread_rank(), val);
}

/*  prints for each group:
    0: 0
    1: 1
    2: 3
    3: 6
    4: 10
    5: 15
    6: 21
    7: 28
*/

Example of stream compaction using exclusive_scan:
使用 exclusive_scan 进行流压缩的示例:

#include <cooperative_groups.h>
#include <cooperative_groups/scan.h>
namespace cg = cooperative_groups;

// put data from input into output only if it passes test_fn predicate
template<typename Group, typename Data, typename TyFn>
__device__ int stream_compaction(Group &g, Data *input, int count, TyFn&& test_fn, Data *output) {
    int per_thread = count / g.num_threads();
    int thread_start = min(g.thread_rank() * per_thread, count);
    int my_count = min(per_thread, count - thread_start);

    // get all passing items from my part of the input
    //  into a contagious part of the array and count them.
    int i = thread_start;
    while (i < my_count + thread_start) {
        if (test_fn(input[i])) {
            i++;
        }
        else {
            my_count--;
            input[i] = input[my_count + thread_start];
        }
    }

    // scan over counts from each thread to calculate my starting
    //  index in the output
    int my_idx = cg::exclusive_scan(g, my_count);

    for (i = 0; i < my_count; ++i) {
        output[my_idx + i] = input[thread_start + i];
    }
    // return the total number of items in the output
    return g.shfl(my_idx + my_count, g.num_threads() - 1);
}

Example of dynamic buffer space allocation using exclusive_scan_update:
使用 exclusive_scan_update 动态分配缓冲空间的示例:

#include <cooperative_groups.h>
#include <cooperative_groups/scan.h>
namespace cg = cooperative_groups;

// Buffer partitioning is static to make the example easier to follow,
// but any arbitrary dynamic allocation scheme can be implemented by replacing this function.
__device__ int calculate_buffer_space_needed(cg::thread_block_tile<32>& tile) {
    return tile.thread_rank() % 2 + 1;
}

__device__ int my_thread_data(int i) {
    return i;
}

__global__ void kernel() {
    __shared__ extern int buffer[];
    __shared__ cuda::atomic<int, cuda::thread_scope_block> buffer_used;

    auto block = cg::this_thread_block();
    auto tile = cg::tiled_partition<32>(block);
    buffer_used = 0;
    block.sync();

    // each thread calculates buffer size it needs
    int buf_needed = calculate_buffer_space_needed(tile);

    // scan over the needs of each thread, result for each thread is an offset
    // of that thread’s part of the buffer. buffer_used is atomically updated with
    // the sum of all thread's inputs, to correctly offset other tile’s allocations
    int buf_offset =
        cg::exclusive_scan_update(tile, buffer_used, buf_needed);

    // each thread fills its own part of the buffer with thread specific data
    for (int i = 0 ; i < buf_needed ; ++i) {
        buffer[buf_offset + i] = my_thread_data(i);
    }

    block.sync();
    // buffer_used now holds total amount of memory allocated
    // buffer is {0, 0, 1, 0, 0, 1 ...};
}

8.6.4. Execution control
8.6.4. 执行控制 

8.6.4.1. invoke_one and invoke_one_broadcast
8.6.4.1. invoke_oneinvoke_one_broadcast

template<typename Group, typename Fn, typename... Args>
void invoke_one(const Group& group, Fn&& fn, Args&&... args);

template<typename Group, typename Fn, typename... Args>
auto invoke_one_broadcast(const Group& group, Fn&& fn, Args&&... args) -> decltype(fn(args...));

invoke_one selects a single arbitrary thread from the calling group and uses that thread to call the supplied invocable fn with the supplied arguments args. In case of invoke_one_broadcast the result of the call is also distributed to all threads in the group and returned from this collective.
invoke_one 从调用方选择一个任意线程,并使用该线程调用提供的可调用项 fn 以及提供的参数 args 。在 invoke_one_broadcast 的情况下,调用的结果也会分发到组中的所有线程,并从此集体返回。

Calling group can be synchronized with the selected thread before and/or after it calls the supplied invocable. It means that communication within the calling group is not allowed inside the supplied invocable body, otherwise forward progress is not guaranteed. Communication with threads outside of the calling group is allowed in the body of the supplied invocable. Thread selection mechanism is not guranteed to be deterministic.
在调用组在调用提供的可调用函数之前和/或之后,可以与所选线程同步。这意味着在提供的可调用函数体内不允许调用组内的通信,否则无法保证前进进度。在提供的可调用函数体内允许与调用组外的线程通信。线程选择机制不能保证是确定性的。

On devices with Compute Capability 9.0 or higher hardware acceleration might be used to select the thread when called with explicit group types.
在具有计算能力 9.0 或更高版本的设备上,当使用显式组类型调用时,可能会使用硬件加速来选择线程。

group: All group types are valid for invoke_one, coalesced_group and thread_block_tile are valid for invoke_one_broadcast.
group :所有组类型对 invoke_one 有效, coalesced_groupthread_block_tileinvoke_one_broadcast 有效。

fn: Function or object that can be invoked using operator().
fn :可以使用 operator() 调用的函数或对象。

args: Parameter pack of types matching types of parameters of the supplied invocable fn.
args :类型参数包,与提供的可调用对象 fn 的参数类型匹配。

In case of invoke_one_broadcast the return type of the supplied invocable fn must satisfy the below requirements:
invoke_one_broadcast 的情况下,所提供的可调用 fn 的返回类型必须满足以下要求:

  • Qualifies as trivially copyable i.e. is_trivially_copyable<T>::value == true
    符合可以被平凡复制的条件,即 is_trivially_copyable<T>::value == true

  • sizeof(T) <= 32 for coalesced_group and tiles of size lower or equal 32, sizeof(T) <= 8 for larger tiles
    sizeof(T) <= 32 适用于 coalesced_group 和尺寸小于或等于 32 的瓦片, sizeof(T) <= 8 适用于较大的瓦片

Codegen Requirements: Compute Capability 5.0 minimum, Compute Capability 9.0 for hardware acceleration, C++11.
代码生成要求:最低计算能力为 5.0,硬件加速需要计算能力为 9.0,使用 C++11。

Aggregated atomic example from Discovery pattern section re-written to use invoke_one_broadcast:
从 Discovery 模式部分重新编写的聚合原子示例,以使用 invoke_one_broadcast:

#include <cooperative_groups.h>
#include <cuda/atomic>
namespace cg = cooperative_groups;

template<cuda::thread_scope Scope>
__device__ unsigned int atomicAddOneRelaxed(cuda::atomic<unsigned int, Scope>& atomic) {
    auto g = cg::coalesced_threads();
    auto prev = cg::invoke_one_broadcast(g, [&] () {
        return atomic.fetch_add(g.num_threads(), cuda::memory_order_relaxed);
    });
    return prev + g.thread_rank();
}

8.7. Grid Synchronization
8.7. 网格同步 

Prior to the introduction of Cooperative Groups, the CUDA programming model only allowed synchronization between thread blocks at a kernel completion boundary. The kernel boundary carries with it an implicit invalidation of state, and with it, potential performance implications.
在引入协作组之前,CUDA 编程模型只允许在内核完成边界处的线程块之间进行同步。内核边界随之带来状态的隐式失效,以及潜在的性能影响。

For example, in certain use cases, applications have a large number of small kernels, with each kernel representing a stage in a processing pipeline. The presence of these kernels is required by the current CUDA programming model to ensure that the thread blocks operating on one pipeline stage have produced data before the thread block operating on the next pipeline stage is ready to consume it. In such cases, the ability to provide global inter thread block synchronization would allow the application to be restructured to have persistent thread blocks, which are able to synchronize on the device when a given stage is complete.
例如,在某些使用情况下,应用程序具有大量的小内核,每个内核代表处理管道中的一个阶段。当前的 CUDA 编程模型要求存在这些内核,以确保在下一个处理管道阶段上运行的线程块在准备好消耗数据之前,上一个处理管道阶段上运行的线程块已经生成数据。在这种情况下,提供全局线程块间同步的能力将允许重新构造应用程序,使其具有持久的线程块,这些线程块能够在给定阶段完成时在设备上同步。

To synchronize across the grid, from within a kernel, you would simply use the grid.sync() function:
要在内核中跨网格同步,您只需使用 grid.sync() 函数:

grid_group grid = this_grid();
grid.sync();

And when launching the kernel it is necessary to use, instead of the <<<...>>> execution configuration syntax, the cudaLaunchCooperativeKernel CUDA runtime launch API or the CUDA driver equivalent.
在启动内核时,需要使用 cudaLaunchCooperativeKernel CUDA 运行时启动 API 或 CUDA driver equivalent ,而不是 <<<...>>> 执行配置语法。

Example: 示例:

To guarantee co-residency of the thread blocks on the GPU, the number of blocks launched needs to be carefully considered. For example, as many blocks as there are SMs can be launched as follows:
为了确保 GPU 上线程块的共存,需要仔细考虑启动的块数。例如,可以启动与 SM 数量相同的块,如下所示:

int dev = 0;
cudaDeviceProp deviceProp;
cudaGetDeviceProperties(&deviceProp, dev);
// initialize, then launch
cudaLaunchCooperativeKernel((void*)my_kernel, deviceProp.multiProcessorCount, numThreads, args);

Alternatively, you can maximize the exposed parallelism by calculating how many blocks can fit simultaneously per-SM using the occupancy calculator as follows:
或者,您可以通过以下方式使用占用率计算器,计算每个 SM 可同时容纳多少个块,从而最大化暴露的并行性:

/// This will launch a grid that can maximally fill the GPU, on the default stream with kernel arguments
int numBlocksPerSm = 0;
 // Number of threads my_kernel will be launched with
int numThreads = 128;
cudaDeviceProp deviceProp;
cudaGetDeviceProperties(&deviceProp, dev);
cudaOccupancyMaxActiveBlocksPerMultiprocessor(&numBlocksPerSm, my_kernel, numThreads, 0);
// launch
void *kernelArgs[] = { /* add kernel args */ };
dim3 dimBlock(numThreads, 1, 1);
dim3 dimGrid(deviceProp.multiProcessorCount*numBlocksPerSm, 1, 1);
cudaLaunchCooperativeKernel((void*)my_kernel, dimGrid, dimBlock, kernelArgs);

It is good practice to first ensure the device supports cooperative launches by querying the device attribute cudaDevAttrCooperativeLaunch:
首先,最好通过查询设备属性 cudaDevAttrCooperativeLaunch 来确保设备支持协作启动

int dev = 0;
int supportsCoopLaunch = 0;
cudaDeviceGetAttribute(&supportsCoopLaunch, cudaDevAttrCooperativeLaunch, dev);

which will set supportsCoopLaunch to 1 if the property is supported on device 0. Only devices with compute capability of 6.0 and higher are supported. In addition, you need to be running on either of these:
如果设备 0 支持属性,则将 supportsCoopLaunch 设置为 1。 仅支持计算能力为 6.0 及更高版本的设备。此外,您需要在以下任一设备上运行:

  • The Linux platform without MPS
    没有 MPS 的 Linux 平台

  • The Linux platform with MPS and on a device with compute capability 7.0 or higher
    具有 MPS 功能的 Linux 平台,并且在具有计算能力 7.0 或更高版本的设备上。

  • The latest Windows platform
    最新的 Windows 平台

8.8. Multi-Device Synchronization
8.8. 多设备同步 

In order to enable synchronization across multiple devices with Cooperative Groups, use of the cudaLaunchCooperativeKernelMultiDevice CUDA API is required. This, a significant departure from existing CUDA APIs, will allow a single host thread to launch a kernel across multiple devices. In addition to the constraints and guarantees made by cudaLaunchCooperativeKernel, this API has additional semantics:
为了使用 Cooperative Groups 在多个设备之间实现同步,需要使用 cudaLaunchCooperativeKernelMultiDevice CUDA API。这是与现有 CUDA API 明显不同的地方,它允许单个主机线程在多个设备上启动内核。除了 cudaLaunchCooperativeKernel 所做的约束和保证之外,此 API 还具有额外的语义:

  • This API will ensure that a launch is atomic, i.e. if the API call succeeds, then the provided number of thread blocks will launch on all specified devices.
    此 API 将确保启动是原子的,即如果 API 调用成功,那么将在所有指定的设备上启动提供的线程块数。

  • The functions launched via this API must be identical. No explicit checks are done by the driver in this regard because it is largely not feasible. It is up to the application to ensure this.
    通过此 API 启动的功能必须是相同的。驱动程序不会对此进行显式检查,因为这在很大程度上是不可行的。这取决于应用程序来确保这一点。

  • No two entries in the provided cudaLaunchParams may map to the same device.
    所提供的 cudaLaunchParams 中没有两个条目可以映射到相同的设备。

  • All devices being targeted by this launch must be of the same compute capability - major and minor versions.
    此次发布的所有目标设备必须具有相同的计算能力 - 主要和次要版本。

  • The block size, grid size and amount of shared memory per grid must be the same across all devices. Note that this means the maximum number of blocks that can be launched per device will be limited by the device with the least number of SMs.
    块大小、网格大小和每个网格的共享内存量在所有设备上必须相同。请注意,这意味着每个设备可以启动的最大块数将受到具有最少数量 SM 的设备的限制。

  • Any user defined __device__, __constant__ or __managed__ device global variables present in the module that owns the CUfunction being launched are independently instantiated on every device. The user is responsible for initializing such device global variables appropriately.
    在拥有被启动的 CUfunction 的模块中存在的任何用户定义的 __device____constant____managed__ 设备全局变量都会在每个设备上独立实例化。用户负责适当初始化这些设备全局变量。

Deprecation Notice: cudaLaunchCooperativeKernelMultiDevice has been deprecated in CUDA 11.3 for all devices. Example of an alternative approach can be found in the multi device conjugate gradient sample.
弃用通知: cudaLaunchCooperativeKernelMultiDevice 在 CUDA 11.3 中已弃用所有设备。可以在多设备共轭梯度示例中找到替代方法的示例。

Optimal performance in multi-device synchronization is achieved by enabling peer access via cuCtxEnablePeerAccess or cudaDeviceEnablePeerAccess for all participating devices.
通过在所有参与设备上启用对等访问,可以实现多设备同步的最佳性能,方式为 cuCtxEnablePeerAccesscudaDeviceEnablePeerAccess

The launch parameters should be defined using an array of structs (one per device), and launched with cudaLaunchCooperativeKernelMultiDevice
启动参数应使用结构数组(每个设备一个),并使用 cudaLaunchCooperativeKernelMultiDevice 启动

Example: 示例:

cudaDeviceProp deviceProp;
cudaGetDeviceCount(&numGpus);

// Per device launch parameters
cudaLaunchParams *launchParams = (cudaLaunchParams*)malloc(sizeof(cudaLaunchParams) * numGpus);
cudaStream_t *streams = (cudaStream_t*)malloc(sizeof(cudaStream_t) * numGpus);

// The kernel arguments are copied over during launch
// Its also possible to have individual copies of kernel arguments per device, but
// the signature and name of the function/kernel must be the same.
void *kernelArgs[] = { /* Add kernel arguments */ };

for (int i = 0; i < numGpus; i++) {
    cudaSetDevice(i);
    // Per device stream, but its also possible to use the default NULL stream of each device
    cudaStreamCreate(&streams[i]);
    // Loop over other devices and cudaDeviceEnablePeerAccess to get a faster barrier implementation
}
// Since all devices must be of the same compute capability and have the same launch configuration
// it is sufficient to query device 0 here
cudaGetDeviceProperties(&deviceProp[i], 0);
dim3 dimBlock(numThreads, 1, 1);
dim3 dimGrid(deviceProp.multiProcessorCount, 1, 1);
for (int i = 0; i < numGpus; i++) {
    launchParamsList[i].func = (void*)my_kernel;
    launchParamsList[i].gridDim = dimGrid;
    launchParamsList[i].blockDim = dimBlock;
    launchParamsList[i].sharedMem = 0;
    launchParamsList[i].stream = streams[i];
    launchParamsList[i].args = kernelArgs;
}
cudaLaunchCooperativeKernelMultiDevice(launchParams, numGpus);

Also, as with grid-wide synchronization, the resulting device code looks very similar:
与整个网格同步一样,生成的设备代码看起来非常相似:

multi_grid_group multi_grid = this_multi_grid();
multi_grid.sync();

However, the code needs to be compiled in separate compilation by passing -rdc=true to nvcc.
然而,通过将 -rdc=true 传递给 nvcc,代码需要进行单独的编译。

It is good practice to first ensure the device supports multi-device cooperative launches by querying the device attribute cudaDevAttrCooperativeMultiDeviceLaunch:
首先,通过查询设备属性 cudaDevAttrCooperativeMultiDeviceLaunch 来确保设备支持多设备协作启动是一个好的做法:

int dev = 0;
int supportsMdCoopLaunch = 0;
cudaDeviceGetAttribute(&supportsMdCoopLaunch, cudaDevAttrCooperativeMultiDeviceLaunch, dev);

which will set supportsMdCoopLaunch to 1 if the property is supported on device 0. Only devices with compute capability of 6.0 and higher are supported. In addition, you need to be running on the Linux platform (without MPS) or on current versions of Windows with the device in TCC mode.
如果设备 0 支持属性,则将 supportsMdCoopLaunch 设置为 1。 仅支持计算能力为 6.0 及更高版本的设备。此外,您需要在 Linux 平台(无 MPS)上运行,或者在当前版本的 Windows 上以 TCC 模式运行设备。

See the cudaLaunchCooperativeKernelMultiDevice API documentation for more information.
查看 cudaLaunchCooperativeKernelMultiDevice API 文档以获取更多信息。

9. CUDA Dynamic Parallelism
9. CUDA 动态并行性 

9.1. Introduction 9.1. 简介 

9.1.1. Overview 9.1.1. 概述 

Dynamic Parallelism is an extension to the CUDA programming model enabling a CUDA kernel to create and synchronize with new work directly on the GPU. The creation of parallelism dynamically at whichever point in a program that it is needed offers exciting capabilities.
动态并行性是 CUDA 编程模型的扩展,使 CUDA 内核能够直接在 GPU 上创建新的工作并与之同步。在程序中任何需要的地方动态创建并行性,提供了令人兴奋的功能。

The ability to create work directly from the GPU can reduce the need to transfer execution control and data between host and device, as launch configuration decisions can now be made at runtime by threads executing on the device. Additionally, data-dependent parallel work can be generated inline within a kernel at run-time, taking advantage of the GPU’s hardware schedulers and load balancers dynamically and adapting in response to data-driven decisions or workloads. Algorithms and programming patterns that had previously required modifications to eliminate recursion, irregular loop structure, or other constructs that do not fit a flat, single-level of parallelism may more transparently be expressed.
直接从 GPU 创建工作的能力可以减少在主机和设备之间传输执行控制和数据的需求,因为启动配置决策现在可以由在设备上执行的线程在运行时进行。此外,可以在内核内联生成依赖于数据的并行工作,利用 GPU 的硬件调度程序和动态负载平衡器,并根据数据驱动的决策或工作负载做出调整。以前需要修改以消除递归、不规则循环结构或其他不适合平面、单级并行性的构造的算法和编程模式可能更透明地表达。

This document describes the extended capabilities of CUDA which enable Dynamic Parallelism, including the modifications and additions to the CUDA programming model necessary to take advantage of these, as well as guidelines and best practices for exploiting this added capacity.
本文档描述了 CUDA 的扩展功能,包括启用动态并行性的功能,以及对 CUDA 编程模型的修改和添加,以便利用这些功能,同时提供利用此增加容量的指导方针和最佳实践。

Dynamic Parallelism is only supported by devices of compute capability 3.5 and higher.
动态并行仅受到计算能力为 3.5 及更高的设备支持。

9.1.2. Glossary 9.1.2. 术语表 

Definitions for terms used in this guide.
本指南中使用的术语定义。

Grid 网格

A Grid is a collection of Threads. Threads in a Grid execute a Kernel Function and are divided into Thread Blocks.
网格是线程的集合。 网格中的线程执行内核函数,并被划分为线程块。

Thread Block 线程块

A Thread Block is a group of threads which execute on the same multiprocessor (SM). Threads within a Thread Block have access to shared memory and can be explicitly synchronized.
线程块是一组在线同一多处理器(SM)上执行的线程。线程块内的线程可以访问共享内存,并且可以被显式同步。

Kernel Function 内核函数

A Kernel Function is an implicitly parallel subroutine that executes under the CUDA execution and memory model for every Thread in a Grid.
内核函数是在 CUDA 执行和内存模型下为网格中的每个线程执行的隐式并行子例程。

Host 主机

The Host refers to the execution environment that initially invoked CUDA. Typically the thread running on a system’s CPU processor.
主机是最初调用 CUDA 的执行环境。通常是在系统 CPU 处理器上运行的线程。

Parent 父级

A Parent Thread, Thread Block, or Grid is one that has launched new grid(s), the Child Grid(s). The Parent is not considered completed until all of its launched Child Grids have also completed.
父线程、线程块或网格是启动新网格(子网格)的线程。在所有启动的子网格也完成之前,父线程不被视为已完成。

Child 子女

A Child thread, block, or grid is one that has been launched by a Parent grid. A Child grid must complete before the Parent Thread, Thread Block, or Grid are considered complete.
子线程、块或网格是由父网格启动的线程。子网格必须在父线程、线程块或网格被视为完成之前完成。

Thread Block Scope 线程块范围

Objects with Thread Block Scope have the lifetime of a single Thread Block. They only have defined behavior when operated on by Threads in the Thread Block that created the object and are destroyed when the Thread Block that created them is complete.
具有线程块范围的对象的生存期为单个线程块。只有在创建对象的线程块中的线程对其进行操作时,它们才具有定义行为,并且当创建它们的线程块完成时被销毁。

Device Runtime 设备运行时

The Device Runtime refers to the runtime system and APIs available to enable Kernel Functions to use Dynamic Parallelism.
设备运行时是指运行时系统和 API,可用于使内核函数使用动态并行性。

9.2. Execution Environment and Memory Model
9.2. 执行环境和内存模型 

9.2.1. Execution Environment
9.2.1. 执行环境 

The CUDA execution model is based on primitives of threads, thread blocks, and grids, with kernel functions defining the program executed by individual threads within a thread block and grid. When a kernel function is invoked the grid’s properties are described by an execution configuration, which has a special syntax in CUDA. Support for dynamic parallelism in CUDA extends the ability to configure, launch, and implicitly synchronize upon new grids to threads that are running on the device.
CUDA 执行模型基于线程、线程块和网格的基本原语,核函数定义了线程块和网格内各个线程执行的程序。当调用核函数时,网格的属性由执行配置描述,CUDA 中具有特殊语法。CUDA 中对动态并行性的支持扩展了配置、启动和隐式同步新网格到设备上运行的线程的能力。

9.2.1.1. Parent and Child Grids
9.2.1.1. 父子网格 

A device thread that configures and launches a new grid belongs to the parent grid, and the grid created by the invocation is a child grid.
一个设备线程,用于配置和启动一个新的网格,属于父网格,并且调用创建的网格是子网格。

The invocation and completion of child grids is properly nested, meaning that the parent grid is not considered complete until all child grids created by its threads have completed, and the runtime guarantees an implicit synchronization between the parent and child.
子网格的调用和完成是正确嵌套的,这意味着父网格直到其线程创建的所有子网格完成后才被视为完成,并且运行时保证父网格和子网格之间存在隐式同步。

Parent-Child Launch Nesting

Figure 26 Parent-Child Launch Nesting
图 26 父子启动嵌套 

9.2.1.2. Scope of CUDA Primitives
9.2.1.2. CUDA 基元的范围 

On both host and device, the CUDA runtime offers an API for launching kernels and for tracking dependencies between launches via streams and events. On the host system, the state of launches and the CUDA primitives referencing streams and events are shared by all threads within a process; however processes execute independently and may not share CUDA objects.
在主机和设备上,CUDA 运行时提供了一个 API,用于启动内核并通过流和事件跟踪启动之间的依赖关系。在主机系统上,启动的状态以及引用流和事件的 CUDA 原语由进程内的所有线程共享;但是进程是独立执行的,可能不共享 CUDA 对象。

On the device, launched kernels and CUDA objects are visible to all threads in a grid. This means, for example, that a stream may be created by one thread and used by any other thread in the grid.
在设备上,启动的内核和 CUDA 对象对网格中的所有线程可见。这意味着,例如,一个流可以由一个线程创建,并被网格中的任何其他线程使用。

9.2.1.3. Synchronization
9.2.1.3. 同步 

Warning 警告

Explicit synchronization with child kernels from a parent block (i.e. using cudaDeviceSynchronize() in device code) is deprecated in CUDA 11.6 and removed for compute_90+ compilation. For compute capability < 9.0, compile-time opt-in by specifying -DCUDA_FORCE_CDP1_IF_SUPPORTED is required to continue using cudaDeviceSynchronize() in device code. Note that this is slated for full removal in a future CUDA release.
在 CUDA 11.6 中,从父块明确与子内核同步(即在设备代码中使用 cudaDeviceSynchronize() )已被弃用,并在 compute_90+编译中移除。对于计算能力<9.0,需要通过指定 -DCUDA_FORCE_CDP1_IF_SUPPORTED 进行编译时选择才能继续在设备代码中使用 cudaDeviceSynchronize() 。请注意,这将在未来的 CUDA 版本中完全移除。

CUDA runtime operations from any thread, including kernel launches, are visible across all the threads in a grid. This means that an invoking thread in the parent grid may perform synchronization to control the launch order of grids launched by any thread in the grid on streams created by any thread in the grid. Execution of a grid is not considered complete until all launches by all threads in the grid have completed. If all threads in a grid exit before all child launches have completed, an implicit synchronization operation will automatically be triggered.
任何线程的 CUDA 运行时操作,包括内核启动,都可以在网格中的所有线程中可见。这意味着父网格中的调用线程可以执行同步操作,以控制由网格中任何线程启动的网格的启动顺序,这些网格是由网格中的任何线程创建的流。直到网格中所有线程的所有启动都完成,网格的执行才被视为完成。如果网格中的所有线程在所有子启动完成之前退出,则将自动触发隐式同步操作。

9.2.1.4. Streams and Events
9.2.1.4. 流和事件 

CUDA Streams and Events allow control over dependencies between grid launches: grids launched into the same stream execute in-order, and events may be used to create dependencies between streams. Streams and events created on the device serve this exact same purpose.
CUDA 流和事件允许控制网格启动之间的依赖关系:启动到同一流的网格按顺序执行,并且事件可用于在流之间创建依赖关系。在设备上创建的流和事件具有完全相同的目的。

Streams and events created within a grid exist within grid scope, but have undefined behavior when used outside of the grid where they were created. As described above, all work launched by a grid is implicitly synchronized when the grid exits; work launched into streams is included in this, with all dependencies resolved appropriately. The behavior of operations on a stream that has been modified outside of grid scope is undefined.
在网格内创建的流和事件存在于网格范围内,但在超出创建它们的网格范围之外使用时行为未定义。正如上文所述,由网格启动的所有工作在网格退出时会隐式同步;启动到流中的工作也包括在内,所有依赖项会得到适当解决。对于在网格范围之外修改过的流上的操作行为是未定义的。

Streams and events created on the host have undefined behavior when used within any kernel, just as streams and events created by a parent grid have undefined behavior if used within a child grid.
在主机上创建的流和事件在任何内核中使用时具有未定义的行为,就像由父网格创建的流和事件在子网格中使用时具有未定义的行为一样。

9.2.1.5. Ordering and Concurrency
9.2.1.5. 排序和并发 

The ordering of kernel launches from the device runtime follows CUDA Stream ordering semantics. Within a grid, all kernel launches into the same stream (with the exception of the fire-and-forget stream discussed later) are executed in-order. With multiple threads in the same grid launching into the same stream, the ordering within the stream is dependent on the thread scheduling within the grid, which may be controlled with synchronization primitives such as __syncthreads().
设备运行时的内核启动顺序遵循 CUDA 流排序语义。在一个网格内,所有进入同一流的内核启动(除了后面讨论的 fire-and-forget 流之外)是按顺序执行的。在同一网格中有多个线程进入同一流进行启动时,流内的顺序取决于网格内的线程调度,可以使用同步原语(如 __syncthreads() )来控制。

Note that while named streams are shared by all threads within a grid, the implicit NULL stream is only shared by all threads within a thread block. If multiple threads in a thread block launch into the implicit stream, then these launches will be executed in-order. If multiple threads in different thread blocks launch into the implicit stream, then these launches may be executed concurrently. If concurrency is desired for launches by multiple threads within a thread block, explicit named streams should be used.
请注意,虽然命名流由网格内的所有线程共享,但隐式的 NULL 流仅由线程块内的所有线程共享。如果线程块中的多个线程启动到隐式流中,则这些启动将按顺序执行。如果不同线程块中的多个线程启动到隐式流中,则这些启动可能会并发执行。如果希望线程块内的多个线程启动时并发执行,应使用显式命名流。

Dynamic Parallelism enables concurrency to be expressed more easily within a program; however, the device runtime introduces no new concurrency guarantees within the CUDA execution model. There is no guarantee of concurrent execution between any number of different thread blocks on a device.
动态并行性使得在程序中更容易表达并发性;然而,在 CUDA 执行模型中,设备运行时并未引入新的并发性保证。在设备上,不保证任意数量的不同线程块之间的并发执行。

The lack of concurrency guarantee extends to a parent grid and their child grids. When a parent grid launches a child grid, the child may start to execute once stream dependencies are satisfied and hardware resources are available to host the child, but is not guaranteed to begin execution until the parent grid reaches an implicit synchronization point.
并发性保证的缺失延伸到父网格及其子网格。当父网格启动子网格时,只有在流依赖得到满足且硬件资源可用于托管子网格时,子网格才可能开始执行,但不能保证在父网格达到隐式同步点之前开始执行。

While concurrency will often easily be achieved, it may vary as a function of device configuration, application workload, and runtime scheduling. It is therefore unsafe to depend upon any concurrency between different thread blocks.
尽管并发通常很容易实现,但它可能会因设备配置、应用程序工作负载和运行时调度的不同而变化。因此,依赖于不同线程块之间的任何并发是不安全的。

9.2.1.6. Device Management
9.2.1.6. 设备管理 

There is no multi-GPU support from the device runtime; the device runtime is only capable of operating on the device upon which it is currently executing. It is permitted, however, to query properties for any CUDA capable device in the system.
设备运行时不支持多 GPU 支持;设备运行时只能在当前执行的设备上运行。但是,可以查询系统中任何支持 CUDA 的设备的属性。

9.2.2. Memory Model
9.2.2. 内存模型 

Parent and child grids share the same global and constant memory storage, but have distinct local and shared memory.
父子网格共享相同的全局和常量内存存储,但具有不同的本地和共享内存。

9.2.2.1. Coherence and Consistency
9.2.2.1. 一致性和一致性 

9.2.2.1.1. Global Memory
9.2.2.1.1. 全局内存 

Parent and child grids have coherent access to global memory, with weak consistency guarantees between child and parent. There is only one point of time in the execution of a child grid when its view of memory is fully consistent with the parent thread: at the point when the child grid is invoked by the parent.
父子网格对全局内存具有一致的访问权限,在子网格和父网格之间提供弱一致性保证。在子网格执行过程中,只有一个时间点子网格的内存视图与父线程完全一致:即子网格被父网格调用的时刻。

All global memory operations in the parent thread prior to the child grid’s invocation are visible to the child grid. With the removal of cudaDeviceSynchronize(), it is no longer possible to access the modifications made by the threads in the child grid from the parent grid. The only way to access the modifications made by the threads in the child grid before the parent grid exits is via a kernel launched into the cudaStreamTailLaunch stream.
父线程中在子网格调用之前的所有全局内存操作对子网格是可见的。移除 cudaDeviceSynchronize() 后,无法从父网格访问子网格中线程所做的修改。在父网格退出之前,访问子网格中线程所做的修改的唯一方法是通过在 cudaStreamTailLaunch 流中启动的内核。

In the following example, the child grid executing child_launch is only guaranteed to see the modifications to data made before the child grid was launched. Since thread 0 of the parent is performing the launch, the child will be consistent with the memory seen by thread 0 of the parent. Due to the first __syncthreads() call, the child will see data[0]=0, data[1]=1, …, data[255]=255 (without the __syncthreads() call, only data[0]=0 would be guaranteed to be seen by the child). The child grid is only guaranteed to return at an implicit synchronization. This means that the modifications made by the threads in the child grid are never guaranteed to become available to the parent grid. To access modifications made by child_launch, a tail_launch kernel is launched into the cudaStreamTailLaunch stream.
在以下示例中,执行 child_launch 的子网格只能保证看到在启动子网格之前对 data 所做的修改。由于父级的线程 0 正在执行启动,子级将与父级线程 0 看到的内存一致。由于第一个 __syncthreads() 调用,子级将看到 data[0]=0data[1]=1 ,…, data[255]=255 (如果没有 __syncthreads() 调用,只有 data[0]=0 会被子级看到)。子网格只能保证在隐式同步时返回。这意味着子网格中的线程所做的修改永远不能保证对父网格可用。要访问 child_launch 所做的修改,需要将 tail_launch 内核启动到 cudaStreamTailLaunch 流中。

__global__ void tail_launch(int *data) {
   data[threadIdx.x] = data[threadIdx.x]+1;
}

__global__ void child_launch(int *data) {
   data[threadIdx.x] = data[threadIdx.x]+1;
}

__global__ void parent_launch(int *data) {
   data[threadIdx.x] = threadIdx.x;

   __syncthreads();

   if (threadIdx.x == 0) {
       child_launch<<< 1, 256 >>>(data);
       tail_launch<<< 1, 256, 0, cudaStreamTailLaunch >>>(data);
   }
}

void host_launch(int *data) {
    parent_launch<<< 1, 256 >>>(data);
}
9.2.2.1.2. Zero Copy Memory
9.2.2.1.2. 零拷贝内存 

Zero-copy system memory has identical coherence and consistency guarantees to global memory, and follows the semantics detailed above. A kernel may not allocate or free zero-copy memory, but may use pointers to zero-copy passed in from the host program.
零拷贝系统内存具有与全局内存相同的一致性和一致性保证,并遵循上述详细的语义。内核可能不会分配或释放零拷贝内存,但可以使用从主机程序传入的零拷贝指针。

9.2.2.1.3. Constant Memory
9.2.2.1.3. 常量内存 

Constants may not be modified from the device. They may only be modified from the host, but the behavior of modifying a constant from the host while there is a concurrent grid that access that constant at any point during its lifetime is undefined.
常量可能无法从设备修改。它们只能从主机修改,但在并发网格访问该常量的情况下,从主机修改常量的行为在其生命周期的任何时候都是未定义的。

9.2.2.1.4. Shared and Local Memory
9.2.2.1.4. 共享和本地内存 

Shared and Local memory is private to a thread block or thread, respectively, and is not visible or coherent between parent and child. Behavior is undefined when an object in one of these locations is referenced outside of the scope within which it belongs, and may cause an error.
共享内存和本地内存分别是线程块或线程私有的,不在父级和子级之间可见或一致。当在其所属范围之外引用这些位置中的对象时,行为是未定义的,并可能导致错误。

The NVIDIA compiler will attempt to warn if it can detect that a pointer to local or shared memory is being passed as an argument to a kernel launch. At runtime, the programmer may use the __isGlobal() intrinsic to determine whether a pointer references global memory and so may safely be passed to a child launch.
NVIDIA 编译器将尝试发出警告,如果它检测到指向本地或共享内存的指针被传递为内核启动的参数。在运行时,程序员可以使用 __isGlobal() 内部函数来确定指针是否引用全局内存,因此可以安全地传递给子启动。

Note that calls to cudaMemcpy*Async() or cudaMemset*Async() may invoke new child kernels on the device in order to preserve stream semantics. As such, passing shared or local memory pointers to these APIs is illegal and will return an error.
请注意,调用 cudaMemcpy*Async()cudaMemset*Async() 可能会在设备上调用新的子内核,以保留流语义。因此,将共享或本地内存指针传递给这些 API 是非法的,并将返回错误。

9.2.2.1.5. Local Memory
9.2.2.1.5. 本地内存 

Local memory is private storage for an executing thread, and is not visible outside of that thread. It is illegal to pass a pointer to local memory as a launch argument when launching a child kernel. The result of dereferencing such a local memory pointer from a child will be undefined.
本地内存是执行线程的私有存储,不会对线程外部可见。在启动子内核时,将本地内存指针作为启动参数传递是非法的。从子内核对此类本地内存指针进行解引用的结果将是未定义的。

For example the following is illegal, with undefined behavior if x_array is accessed by child_launch:
例如,如果通过 child_launch 访问 x_array ,则以下操作是非法的,具有未定义的行为:

int x_array[10];       // Creates x_array in parent's local memory
child_launch<<< 1, 1 >>>(x_array);

It is sometimes difficult for a programmer to be aware of when a variable is placed into local memory by the compiler. As a general rule, all storage passed to a child kernel should be allocated explicitly from the global-memory heap, either with cudaMalloc(), new() or by declaring __device__ storage at global scope. For example:
有时,程序员很难意识到编译器何时将变量放入本地内存。一般原则是,传递给子内核的所有存储都应明确从全局内存堆中分配,可以使用 cudaMalloc()new() 或通过在全局范围声明 __device__ 存储。例如:

// Correct - "value" is global storage
__device__ int value;
__device__ void x() {
    value = 5;
    child<<< 1, 1 >>>(&value);
}
// Invalid - "value" is local storage
__device__ void y() {
    int value = 5;
    child<<< 1, 1 >>>(&value);
}
9.2.2.1.6. Texture Memory
9.2.2.1.6. 纹理内存 

Writes to the global memory region over which a texture is mapped are incoherent with respect to texture accesses. Coherence for texture memory is enforced at the invocation of a child grid and when a child grid completes. This means that writes to memory prior to a child kernel launch are reflected in texture memory accesses of the child. Similarly to Global Memory above, writes to memory by a child are never guaranteed to be reflected in the texture memory accesses by a parent. The only way to access the modifications made by the threads in the child grid before the parent grid exits is via a kernel launched into the cudaStreamTailLaunch stream. Concurrent accesses by parent and child may result in inconsistent data.
对于映射纹理的全局内存区域的写入与纹理访问不一致。纹理内存的一致性在调用子网格时强制执行,以及子网格完成时。这意味着在子内核启动之前对内存的写入会反映在子的纹理内存访问中。与上面的全局内存类似,子级别对内存的写入不能保证会反映在父级别对纹理内存的访问中。在父级别网格退出之前,访问子网格线程所做修改的唯一方法是通过在 cudaStreamTailLaunch 流中启动的内核。父级别和子级别的并发访问可能导致数据不一致。

9.3. Programming Interface
9.3. 编程接口 

9.3.1. CUDA C++ Reference
9.3.1. CUDA C++ 参考

This section describes changes and additions to the CUDA C++ language extensions for supporting Dynamic Parallelism.
本节描述了支持动态并行性的 CUDA C++语言扩展的更改和添加。

The language interface and API available to CUDA kernels using CUDA C++ for Dynamic Parallelism, referred to as the Device Runtime, is substantially like that of the CUDA Runtime API available on the host. Where possible the syntax and semantics of the CUDA Runtime API have been retained in order to facilitate ease of code reuse for routines that may run in either the host or device environments.
使用 CUDA C++ 进行动态并行性的 CUDA 内核可用的语言界面和 API(称为设备运行时)与主机上可用的 CUDA 运行时 API 非常相似。在可能的情况下,已保留了 CUDA 运行时 API 的语法和语义,以便为可能在主机或设备环境中运行的例程提供代码重用的便利。

As with all code in CUDA C++, the APIs and code outlined here is per-thread code. This enables each thread to make unique, dynamic decisions regarding what kernel or operation to execute next. There are no synchronization requirements between threads within a block to execute any of the provided device runtime APIs, which enables the device runtime API functions to be called in arbitrarily divergent kernel code without deadlock.
与 CUDA C++中的所有代码一样,这里概述的 API 和代码是每个线程的代码。这使得每个线程能够就执行下一个内核或操作做出独特的动态决策。在块内部的线程之间没有同步要求来执行任何提供的设备运行时 API,这使得可以在任意分歧的内核代码中调用设备运行时 API 函数,而不会发生死锁。

9.3.1.1. Device-Side Kernel Launch
9.3.1.1. 设备端内核启动 

Kernels may be launched from the device using the standard CUDA <<< >>> syntax:
内核可以使用标准的 CUDA <<< >>> 语法从设备启动:

kernel_name<<< Dg, Db, Ns, S >>>([kernel arguments]);
  • Dg is of type dim3 and specifies the dimensions and size of the grid
    Dg 的类型为 dim3 ,指定了网格的维度和大小

  • Db is of type dim3 and specifies the dimensions and size of each thread block
    Db 的类型为 dim3 ,并指定了每个线程块的维度和大小

  • Ns is of type size_t and specifies the number of bytes of shared memory that is dynamically allocated per thread block for this call in addition to statically allocated memory. Ns is an optional argument that defaults to 0.
    Ns 的类型为 size_t ,指定了每个线程块动态分配的共享内存字节数,除了静态分配的内存。 Ns 是一个可选参数,默认值为 0。

  • S is of type cudaStream_t and specifies the stream associated with this call. The stream must have been allocated in the same grid where the call is being made. S is an optional argument that defaults to the NULL stream.
    S 的类型为 cudaStream_t ,指定与此调用关联的流。流必须在进行调用的同一网格中分配。 S 是一个可选参数,默认为 NULL 流。

9.3.1.1.1. Launches are Asynchronous
9.3.1.1.1. 启动是异步的 

Identical to host-side launches, all device-side kernel launches are asynchronous with respect to the launching thread. That is to say, the <<<>>> launch command will return immediately and the launching thread will continue to execute until it hits an implicit launch-synchronization point (such as at a kernel launched into the cudaStreamTailLaunch stream).
与主机端启动相同,所有设备端内核启动都是相对于启动线程异步的。也就是说, <<<>>> 启动命令会立即返回,启动线程将继续执行,直到遇到隐式启动同步点(例如在 cudaStreamTailLaunch 流中启动的内核)。

The child grid launch is posted to the device and will execute independently of the parent thread. The child grid may begin execution at any time after launch, but is not guaranteed to begin execution until the launching thread reaches an implicit launch-synchronization point.
子网格启动已发布到设备,并将独立于父线程执行。子网格可能在启动后的任何时间开始执行,但不能保证在启动线程达到隐式启动同步点之前开始执行。

9.3.1.1.2. Launch Environment Configuration
9.3.1.1.2. 启动环境配置 

All global device configuration settings (for example, shared memory and L1 cache size as returned from cudaDeviceGetCacheConfig(), and device limits returned from cudaDeviceGetLimit()) will be inherited from the parent. Likewise, device limits such as stack size will remain as-configured.
所有全局设备配置设置(例如,从 cudaDeviceGetCacheConfig() 返回的共享内存和 L1 缓存大小,以及从 cudaDeviceGetLimit() 返回的设备限制)将从父级继承。同样,诸如堆栈大小之类的设备限制将保持配置不变。

For host-launched kernels, per-kernel configurations set from the host will take precedence over the global setting. These configurations will be used when the kernel is launched from the device as well. It is not possible to reconfigure a kernel’s environment from the device.
对于主机启动的内核,主机设置的每个内核配置将优先于全局设置。当从设备启动内核时,将使用这些配置。无法从设备重新配置内核的环境。

9.3.1.2. Streams 9.3.1.2. 流

Both named and unnamed (NULL) streams are available from the device runtime. Named streams may be used by any thread within a grid, but stream handles may not be passed to other child/parent kernels. In other words, a stream should be treated as private to the grid in which it is created.
设备运行时提供了命名流和未命名(NULL)流。 命名流可以被网格内的任何线程使用,但流句柄不能传递给其他子/父内核。 换句话说,流应被视为在其创建的网格中私有。

Similar to host-side launch, work launched into separate streams may run concurrently, but actual concurrency is not guaranteed. Programs that depend upon concurrency between child kernels are not supported by the CUDA programming model and will have undefined behavior.
与主机端启动类似,启动到单独流中的工作可能会并发运行,但实际并发性不能保证。依赖于子内核之间并发性的程序不受 CUDA 编程模型支持,并且将具有未定义的行为。

The host-side NULL stream’s cross-stream barrier semantic is not supported on the device (see below for details). In order to retain semantic compatibility with the host runtime, all device streams must be created using the cudaStreamCreateWithFlags() API, passing the cudaStreamNonBlocking flag. The cudaStreamCreate() call is a host-runtime- only API and will fail to compile for the device.
主机端的 NULL 流的跨流屏障语义在设备上不受支持(请参见下文了解详情)。为了保持与主机运行时的语义兼容性,所有设备流必须使用 cudaStreamCreateWithFlags() API 创建,传递 cudaStreamNonBlocking 标志。 cudaStreamCreate() 调用是一个仅限主机运行时的 API,并且无法为设备编译。

As cudaStreamSynchronize() and cudaStreamQuery() are unsupported by the device runtime, a kernel launched into the cudaStreamTailLaunch stream should be used instead when the application needs to know that stream-launched child kernels have completed.
由于设备运行时不支持 cudaStreamSynchronize()cudaStreamQuery() ,因此当应用程序需要知道流启动的子内核何时完成时,应使用启动到 cudaStreamTailLaunch 流的内核。

9.3.1.2.1. The Implicit (NULL) Stream
9.3.1.2.1. 隐式(NULL)流 

Within a host program, the unnamed (NULL) stream has additional barrier synchronization semantics with other streams (see Default Stream for details). The device runtime offers a single implicit, unnamed stream shared between all threads in a thread block, but as all named streams must be created with the cudaStreamNonBlocking flag, work launched into the NULL stream will not insert an implicit dependency on pending work in any other streams (including NULL streams of other thread blocks).
在主机程序中,未命名(NULL)流与其他流具有额外的屏障同步语义(请参阅详细信息中的默认流)。设备运行时提供一个隐式的未命名流,在线程块中的所有线程之间共享,但是由于所有命名流都必须使用 cudaStreamNonBlocking 标志创建,因此在 NULL 流中启动的工作不会对其他流中挂起的工作(包括其他线程块的 NULL 流)产生隐式依赖。

9.3.1.2.2. The Fire-and-Forget Stream
9.3.1.2.2. 火而忘流

The fire-and-forget named stream (cudaStreamFireAndForget) allows the user to launch fire-and-forget work with less boilerplate and without stream tracking overhead. It is functionally identical to, but faster than, creating a new stream per launch, and launching into that stream.
忘记命名流( cudaStreamFireAndForget )允许用户启动忘记命名的工作,减少样板代码,无需流跟踪开销。在功能上与每次启动创建新流并在该流中启动相同,但速度更快。

Fire-and-forget launches are immediately scheduled for launch without any dependency on the completion of previously launched grids. No other grid launches can depend on the completion of a fire-and-forget launch, except through the implicit synchronization at the end of the parent grid. So a tail launch or the next grid in parent grid’s stream won’t launch before a parent grid’s fire-and-forget work has completed.
Fire-and-forget 启动会立即计划启动,而不依赖于先前启动的网格的完成。除了通过父网格末尾的隐式同步外,没有其他网格启动可以依赖于 fire-and-forget 启动的完成。因此,在父网格的 fire-and-forget 工作完成之前,尾部启动或父网格流中的下一个网格不会启动。

// In this example, C2's launch will not wait for C1's completion
__global__ void P( ... ) {
   C1<<< ... , cudaStreamFireAndForget >>>( ... );
   C2<<< ... , cudaStreamFireAndForget >>>( ... );
}

The fire-and-forget stream cannot be used to record or wait on events. Attempting to do so results in cudaErrorInvalidValue. The fire-and-forget stream is not supported when compiled with CUDA_FORCE_CDP1_IF_SUPPORTED defined. Fire-and-forget stream usage requires compilation to be in 64-bit mode.
火而忘流无法用于记录或等待事件。尝试这样做会导致 cudaErrorInvalidValue 。当编译时定义 CUDA_FORCE_CDP1_IF_SUPPORTED 时,不支持火而忘流。火而忘流的使用需要编译为 64 位模式。

9.3.1.2.3. The Tail Launch Stream
9.3.1.2.3. 尾部启动流 

The tail launch named stream (cudaStreamTailLaunch) allows a grid to schedule a new grid for launch after its completion. It should be possible to to use a tail launch to achieve the same functionality as a cudaDeviceSynchronize() in most cases.
尾部启动命名流( cudaStreamTailLaunch )允许网格在完成后调度新网格以启动。在大多数情况下,应该可以使用尾部启动来实现与 cudaDeviceSynchronize() 相同的功能。

Each grid has its own tail launch stream. All non-tail launch work launched by a grid is implicitly synchronized before the tail stream is kicked off. I.e. A parent grid’s tail launch does not launch until the parent grid and all work launched by the parent grid to ordinary streams or per-thread or fire-and-forget streams have completed. If two grids are launched to the same grid’s tail launch stream, the later grid does not launch until the earlier grid and all its descendent work has completed.
每个网格都有自己的尾部启动流。由网格启动的所有非尾部启动工作在尾部流开始之前会隐式同步。即父网格的尾部启动不会启动,直到父网格和父网格启动到普通流、每个线程或启动并忘记流的所有工作都已完成。如果两个网格被启动到同一个网格的尾部启动流,则后一个网格不会启动,直到前一个网格及其所有后代工作都已完成。

// In this example, C2 will only launch after C1 completes.
__global__ void P( ... ) {
   C1<<< ... , cudaStreamTailLaunch >>>( ... );
   C2<<< ... , cudaStreamTailLaunch >>>( ... );
}

Grids launched into the tail launch stream will not launch until the completion of all work by the parent grid, including all other grids (and their descendants) launched by the parent in all non-tail launched streams, including work executed or launched after the tail launch.
将进入尾部启动流的网格在父网格完成所有工作之前不会启动,包括父网格在所有非尾部启动流中启动的所有其他网格(及其后代),包括尾部启动后执行或启动的工作。

// In this example, C will only launch after all X, F and P complete.
__global__ void P( ... ) {
   C<<< ... , cudaStreamTailLaunch >>>( ... );
   X<<< ... , cudaStreamPerThread >>>( ... );
   F<<< ... , cudaStreamFireAndForget >>>( ... )
}

The next grid in the parent grid’s stream will not be launched before a parent grid’s tail launch work has completed. In other words, the tail launch stream behaves as if it were inserted between its parent grid and the next grid in its parent grid’s stream.
父网格流中的下一个网格在父网格的尾部启动工作完成之前不会启动。换句话说,尾部启动流的行为就好像它被插入到其父网格和其父网格流中的下一个网格之间。

// In this example, P2 will only launch after C completes.
__global__ void P1( ... ) {
   C<<< ... , cudaStreamTailLaunch >>>( ... );
}

__global__ void P2( ... ) {
}

int main ( ... ) {
   ...
   P1<<< ... >>>( ... );
   P2<<< ... >>>( ... );
   ...
}

Each grid only gets one tail launch stream. To tail launch concurrent grids, it can be done like the example below.
每个网格只能获得一个尾部启动流。要尾部启动并发网格,可以像下面的示例一样操作。

// In this example,  C1 and C2 will launch concurrently after P's completion
__global__ void T( ... ) {
   C1<<< ... , cudaStreamFireAndForget >>>( ... );
   C2<<< ... , cudaStreamFireAndForget >>>( ... );
}

__global__ void P( ... ) {
   ...
   T<<< ... , cudaStreamTailLaunch >>>( ... );
}

The tail launch stream cannot be used to record or wait on events. Attempting to do so results in cudaErrorInvalidValue. The tail launch stream is not supported when compiled with CUDA_FORCE_CDP1_IF_SUPPORTED defined. Tail launch stream usage requires compilation to be in 64-bit mode.
尾部启动流无法用于记录或等待事件。尝试这样做会导致 cudaErrorInvalidValue 。当编译时定义了 CUDA_FORCE_CDP1_IF_SUPPORTED 时,不支持尾部启动流。尾部启动流的使用需要编译为 64 位模式。

9.3.1.3. Events 9.3.1.3. 事件 

Only the inter-stream synchronization capabilities of CUDA events are supported. This means that cudaStreamWaitEvent() is supported, but cudaEventSynchronize(), cudaEventElapsedTime(), and cudaEventQuery() are not. As cudaEventElapsedTime() is not supported, cudaEvents must be created via cudaEventCreateWithFlags(), passing the cudaEventDisableTiming flag.
仅支持 CUDA 事件的跨流同步能力。这意味着支持 cudaStreamWaitEvent() ,但不支持 cudaEventSynchronize()cudaEventElapsedTime()cudaEventQuery() 。由于不支持 cudaEventElapsedTime() ,因此必须通过 cudaEventCreateWithFlags() 创建 cudaEvents,传递 cudaEventDisableTiming 标志。

As with named streams, event objects may be shared between all threads within the grid which created them but are local to that grid and may not be passed to other kernels. Event handles are not guaranteed to be unique between grids, so using an event handle within a grid that did not create it will result in undefined behavior.
与命名流一样,事件对象可以在创建它们的网格内的所有线程之间共享,但是对于该网格而言是本地的,不能传递给其他内核。事件句柄在网格之间不保证唯一,因此在未创建该事件的网格内使用事件句柄将导致未定义行为。

9.3.1.4. Synchronization
9.3.1.4. 同步 

It is up to the program to perform sufficient inter-thread synchronization, for example via a CUDA Event, if the calling thread is intended to synchronize with child grids invoked from other threads.
程序需要执行足够的线程间同步,例如通过 CUDA 事件,如果调用线程打算与从其他线程调用的子网格同步。

As it is not possible to explicitly synchronize child work from a parent thread, there is no way to guarantee that changes occuring in child grids are visible to threads within the parent grid.
由于无法显式地从父线程同步子工作,因此无法保证在子网格中发生的更改对父网格内的线程可见。

9.3.1.5. Device Management
9.3.1.5. 设备管理 

Only the device on which a kernel is running will be controllable from that kernel. This means that device APIs such as cudaSetDevice() are not supported by the device runtime. The active device as seen from the GPU (returned from cudaGetDevice()) will have the same device number as seen from the host system. The cudaDeviceGetAttribute() call may request information about another device as this API allows specification of a device ID as a parameter of the call. Note that the catch-all cudaGetDeviceProperties() API is not offered by the device runtime - properties must be queried individually.
只有运行内核的设备才能从该内核进行控制。这意味着设备运行时不支持诸如 cudaSetDevice() 之类的设备 API。从 GPU 看到的活动设备(从 cudaGetDevice() 返回)将与从主机系统看到的设备编号相同。 cudaDeviceGetAttribute() 调用可能会请求有关另一个设备的信息,因为此 API 允许在调用的参数中指定设备 ID。请注意,设备运行时不提供通用的 cudaGetDeviceProperties() API - 必须逐个查询属性。

9.3.1.6. Memory Declarations
9.3.1.6. 内存声明 

9.3.1.6.1. Device and Constant Memory
9.3.1.6.1. 设备和常量内存 

Memory declared at file scope with __device__ or __constant__ memory space specifiers behaves identically when using the device runtime. All kernels may read or write device variables, whether the kernel was initially launched by the host or device runtime. Equivalently, all kernels will have the same view of __constant__s as declared at the module scope.
在文件范围内使用 __device____constant__ 内存空间修饰符声明的内存在使用设备运行时时表现相同。所有内核都可以读取或写入设备变量,无论内核最初是由主机还是设备运行时启动的。同样,所有内核将对在模块范围声明的 __constant__ s 具有相同的视图。

9.3.1.6.2. Textures and Surfaces
9.3.1.6.2. 纹理和表面 

CUDA supports dynamically created texture and surface objects14, where a texture object may be created on the host, passed to a kernel, used by that kernel, and then destroyed from the host. The device runtime does not allow creation or destruction of texture or surface objects from within device code, but texture and surface objects created from the host may be used and passed around freely on the device. Regardless of where they are created, dynamically created texture objects are always valid and may be passed to child kernels from a parent.
CUDA 支持动态创建的纹理和表面对象,其中纹理对象可以在主机上创建,传递给内核,由该内核使用,然后可以从主机销毁。设备运行时不允许在设备代码中创建或销毁纹理或表面对象,但可以在主机上创建的纹理和表面对象可以在设备上自由使用和传递。无论在何处创建,动态创建的纹理对象始终有效,并可以从父级传递给子内核。

Note 注意

The device runtime does not support legacy module-scope (i.e., Fermi-style) textures and surfaces within a kernel launched from the device. Module-scope (legacy) textures may be created from the host and used in device code as for any kernel, but may only be used by a top-level kernel (i.e., the one which is launched from the host).
设备运行时不支持从设备启动的内核中的传统模块范围(即 Fermi 风格)纹理和表面。传统模块范围的纹理可以从主机创建,并在设备代码中用作任何内核,但只能被顶级内核使用(即从主机启动的内核)。

9.3.1.6.3. Shared Memory Variable Declarations
9.3.1.6.3. 共享内存变量声明 

In CUDA C++ shared memory can be declared either as a statically sized file-scope or function-scoped variable, or as an extern variable with the size determined at runtime by the kernel’s caller via a launch configuration argument. Both types of declarations are valid under the device runtime.
在 CUDA C++中,共享内存可以声明为静态大小的文件作用域或函数作用域变量,也可以声明为一个 extern 变量,其大小由内核的调用者通过启动配置参数在运行时确定。这两种声明类型在设备运行时都是有效的。

__global__ void permute(int n, int *data) {
   extern __shared__ int smem[];
   if (n <= 1)
       return;

   smem[threadIdx.x] = data[threadIdx.x];
   __syncthreads();

   permute_data(smem, n);
   __syncthreads();

   // Write back to GMEM since we can't pass SMEM to children.
   data[threadIdx.x] = smem[threadIdx.x];
   __syncthreads();

   if (threadIdx.x == 0) {
       permute<<< 1, 256, n/2*sizeof(int) >>>(n/2, data);
       permute<<< 1, 256, n/2*sizeof(int) >>>(n/2, data+n/2);
   }
}

void host_launch(int *data) {
    permute<<< 1, 256, 256*sizeof(int) >>>(256, data);
}
9.3.1.6.4. Symbol Addresses
9.3.1.6.4. 符号地址 

Device-side symbols (i.e., those marked __device__) may be referenced from within a kernel simply via the & operator, as all global-scope device variables are in the kernel’s visible address space. This also applies to __constant__ symbols, although in this case the pointer will reference read-only data.
设备端符号(即那些标记为 __device__ 的符号)可以通过 & 运算符在内核中直接引用,因为所有全局范围的设备变量都在内核的可见地址空间中。这也适用于 __constant__ 符号,尽管在这种情况下指针将引用只读数据。

Given that device-side symbols can be referenced directly, those CUDA runtime APIs which reference symbols (e.g., cudaMemcpyToSymbol() or cudaGetSymbolAddress()) are redundant and hence not supported by the device runtime. Note this implies that constant data cannot be altered from within a running kernel, even ahead of a child kernel launch, as references to __constant__ space are read-only.
鉴于设备端符号可以直接引用,因此那些引用符号的 CUDA 运行时 API(例如 cudaMemcpyToSymbol()cudaGetSymbolAddress() )是多余的,因此不受设备运行时支持。请注意,这意味着常量数据无法在运行中的内核内部被更改,即使在子内核启动之前,对 __constant__ 空间的引用也是只读的。

9.3.1.7. API Errors and Launch Failures
9.3.1.7. API 错误和启动失败 

As usual for the CUDA runtime, any function may return an error code. The last error code returned is recorded and may be retrieved via the cudaGetLastError() call. Errors are recorded per-thread, so that each thread can identify the most recent error that it has generated. The error code is of type cudaError_t.
与 CUDA 运行时一样,任何函数都可能返回错误代码。返回的最后一个错误代码被记录下来,可以通过 cudaGetLastError() 调用检索。错误是按线程记录的,因此每个线程都可以识别其生成的最近错误。错误代码的类型为 cudaError_t

Similar to a host-side launch, device-side launches may fail for many reasons (invalid arguments, etc). The user must call cudaGetLastError() to determine if a launch generated an error, however lack of an error after launch does not imply the child kernel completed successfully.
与主机端启动类似,设备端启动可能因多种原因(无效参数等)而失败。用户必须调用 cudaGetLastError() 来确定启动是否生成错误,但启动后没有错误并不意味着子内核成功完成。

For device-side exceptions, e.g., access to an invalid address, an error in a child grid will be returned to the host.
对于设备端异常,例如访问无效地址,子网格中的错误将返回给主机。

9.3.1.7.1. Launch Setup APIs
9.3.1.7.1. 启动设置 API 

Kernel launch is a system-level mechanism exposed through the device runtime library, and as such is available directly from PTX via the underlying cudaGetParameterBuffer() and cudaLaunchDevice() APIs. It is permitted for a CUDA application to call these APIs itself, with the same requirements as for PTX. In both cases, the user is then responsible for correctly populating all necessary data structures in the correct format according to specification. Backwards compatibility is guaranteed in these data structures.
内核启动是通过设备运行时库公开的系统级机制,因此可以直接通过底层 cudaGetParameterBuffer()cudaLaunchDevice() API 从 PTX 中使用。CUDA 应用程序可以自行调用这些 API,要求与 PTX 相同。在这两种情况下,用户需要负责根据规范正确填充所有必要的数据结构以正确的格式。这些数据结构保证向后兼容。

As with host-side launch, the device-side operator <<<>>> maps to underlying kernel launch APIs. This is so that users targeting PTX will be able to enact a launch, and so that the compiler front-end can translate <<<>>> into these calls.
与主机端启动一样,设备端操作符 <<<>>> 映射到底层内核启动 API。这样,针对 PTX 的用户就能执行启动操作,编译器前端也能将 <<<>>> 翻译成这些调用。

Table 9 New Device-only Launch Implementation Functions
表 9 新设备专用启动实施功能 

Runtime API Launch Functions
运行时 API 启动函数

Description of Difference From Host Runtime Behaviour (behavior is identical if no description)
与主机运行时行为的差异描述(如果没有描述,则行为相同)

cudaGetParameterBuffer

Generated automatically from <<<>>>. Note different API to host equivalent.
<<<>>> 自动生成。请注意,托管等效的 API 不同。

cudaLaunchDevice

Generated automatically from <<<>>>. Note different API to host equivalent.
<<<>>> 自动生成。请注意,托管等效的 API 不同。

The APIs for these launch functions are different to those of the CUDA Runtime API, and are defined as follows:
这些启动函数的 API 与 CUDA Runtime API 的 API 不同,并定义如下:

extern   device   cudaError_t cudaGetParameterBuffer(void **params);
extern __device__ cudaError_t cudaLaunchDevice(void *kernel,
                                        void *params, dim3 gridDim,
                                        dim3 blockDim,
                                        unsigned int sharedMemSize = 0,
                                        cudaStream_t stream = 0);

9.3.1.8. API Reference
9.3.1.8. API 参考 

The portions of the CUDA Runtime API supported in the device runtime are detailed here. Host and device runtime APIs have identical syntax; semantics are the same except where indicated. The following table provides an overview of the API relative to the version available from the host.
CUDA Runtime API 支持的部分在设备运行时中详细说明。主机和设备运行时 API 具有相同的语法;语义相同,除非另有说明。以下表格提供了相对于主机版本可用 API 的概述。

Table 10 Supported API Functions
表 10 支持的 API 函数 

Runtime API Functions 运行时 API 函数

Details 详细信息

cudaDeviceGetCacheConfig

cudaDeviceGetLimit

cudaGetLastError

Last error is per-thread state, not per-block state
最后一个错误是每个线程的状态,而不是每个块的状态

cudaPeekAtLastError

cudaGetErrorString

cudaGetDeviceCount

cudaDeviceGetAttribute

Will return attributes for any device
将返回任何设备的属性

cudaGetDevice

Always returns current device ID as would be seen from host
始终返回当前设备 ID,就像从主机看到的那样

cudaStreamCreateWithFlags

Must pass cudaStreamNonBlocking flag 必须传递 cudaStreamNonBlocking 标志

cudaStreamDestroy

cudaStreamWaitEvent

cudaEventCreateWithFlags

Must pass cudaEventDisableTiming flag 必须传递 cudaEventDisableTiming 标志

cudaEventRecord

cudaEventDestroy

cudaFuncGetAttributes

cudaMemcpyAsync

Notes about all memcpy/memset functions:
关于所有 memcpy/memset 函数的注释:

  • Only async memcpy/set functions are supported
    仅支持异步 memcpy/set 函数

  • Only device-to-device memcpy is permitted
    仅允许设备到设备的 memcpy

  • May not pass in local or shared memory pointers
    可能无法传递本地或共享内存指针

cudaMemcpy2DAsync

cudaMemcpy3DAsync

cudaMemsetAsync

cudaMemset2DAsync

cudaMemset3DAsync

cudaRuntimeGetVersion

cudaMalloc

May not call cudaFree on the device on a pointer created on the host, and vice-versa
在设备上不允许在主机上创建的指针上调用 cudaFree ,反之亦然

cudaFree

cudaOccupancyMaxActiveBlocksPerMultiprocessor

cudaOccupancyMaxPotentialBlockSize

cudaOccupancyMaxPotentialBlockSizeVariableSMem

9.3.2. Device-side Launch from PTX
9.3.2. 从 PTX 在设备端启动 

This section is for the programming language and compiler implementers who target Parallel Thread Execution (PTX) and plan to support Dynamic Parallelism in their language. It provides the low-level details related to supporting kernel launches at the PTX level.
本节面向针对并行线程执行(PTX)并计划支持动态并行性的编程语言和编译器实现者。它提供了支持在 PTX 级别进行内核启动的与低级细节相关的信息。

9.3.2.1. Kernel Launch APIs
9.3.2.1. 内核启动 API 

Device-side kernel launches can be implemented using the following two APIs accessible from PTX: cudaLaunchDevice() and cudaGetParameterBuffer(). cudaLaunchDevice() launches the specified kernel with the parameter buffer that is obtained by calling cudaGetParameterBuffer() and filled with the parameters to the launched kernel. The parameter buffer can be NULL, i.e., no need to invoke cudaGetParameterBuffer(), if the launched kernel does not take any parameters.
设备端内核启动可以使用从 PTX 访问的以下两个 API 来实现: cudaLaunchDevice()cudaGetParameterBuffer()cudaLaunchDevice() 使用通过调用 cudaGetParameterBuffer() 获得并填充为启动的内核提供参数的参数缓冲区来启动指定的内核。如果启动的内核不需要任何参数,则参数缓冲区可以为 NULL,即无需调用 cudaGetParameterBuffer()

9.3.2.1.1. cudaLaunchDevice

At the PTX level, cudaLaunchDevice()needs to be declared in one of the two forms shown below before it is used.
在 PTX 级别, cudaLaunchDevice() 需要在使用之前以下面显示的两种形式之一声明。

// PTX-level Declaration of cudaLaunchDevice() when .address_size is 64
.extern .func(.param .b32 func_retval0) cudaLaunchDevice
(
  .param .b64 func,
  .param .b64 parameterBuffer,
  .param .align 4 .b8 gridDimension[12],
  .param .align 4 .b8 blockDimension[12],
  .param .b32 sharedMemSize,
  .param .b64 stream
)
;

The CUDA-level declaration below is mapped to one of the aforementioned PTX-level declarations and is found in the system header file cuda_device_runtime_api.h. The function is defined in the cudadevrt system library, which must be linked with a program in order to use device-side kernel launch functionality.
下面的 CUDA 级别声明被映射到前面提到的 PTX 级别声明之一,并且可以在系统头文件 cuda_device_runtime_api.h 中找到。该函数在 cudadevrt 系统库中定义,必须与程序链接才能使用设备端内核启动功能。

// CUDA-level declaration of cudaLaunchDevice()
extern "C" __device__
cudaError_t cudaLaunchDevice(void *func, void *parameterBuffer,
                             dim3 gridDimension, dim3 blockDimension,
                             unsigned int sharedMemSize,
                             cudaStream_t stream);

The first parameter is a pointer to the kernel to be is launched, and the second parameter is the parameter buffer that holds the actual parameters to the launched kernel. The layout of the parameter buffer is explained in Parameter Buffer Layout, below. Other parameters specify the launch configuration, i.e., as grid dimension, block dimension, shared memory size, and the stream associated with the launch (please refer to Execution Configuration for the detailed description of launch configuration.
第一个参数是指向要启动的内核的指针,第二个参数是保存要启动的内核的实际参数的参数缓冲区。参数缓冲区的布局在下面的参数缓冲区布局中进行了解释。其他参数指定启动配置,即网格维度、块维度、共享内存大小以及与启动相关联的流(请参阅执行配置以获取启动配置的详细描述)。

9.3.2.1.2. cudaGetParameterBuffer

cudaGetParameterBuffer() needs to be declared at the PTX level before it’s used. The PTX-level declaration must be in one of the two forms given below, depending on address size:
cudaGetParameterBuffer() 需要在使用之前在 PTX 级别声明。 PTX 级别声明必须采用以下两种形式之一,具体取决于地址大小:

// PTX-level Declaration of cudaGetParameterBuffer() when .address_size is 64
.extern .func(.param .b64 func_retval0) cudaGetParameterBuffer
(
  .param .b64 alignment,
  .param .b64 size
)
;

The following CUDA-level declaration of cudaGetParameterBuffer() is mapped to the aforementioned PTX-level declaration:
以下 CUDA 级别的声明 cudaGetParameterBuffer() 被映射到上述的 PTX 级别声明:

// CUDA-level Declaration of cudaGetParameterBuffer()
extern "C" __device__
void *cudaGetParameterBuffer(size_t alignment, size_t size);

The first parameter specifies the alignment requirement of the parameter buffer and the second parameter the size requirement in bytes. In the current implementation, the parameter buffer returned by cudaGetParameterBuffer() is always guaranteed to be 64- byte aligned, and the alignment requirement parameter is ignored. However, it is recommended to pass the correct alignment requirement value - which is the largest alignment of any parameter to be placed in the parameter buffer - to cudaGetParameterBuffer() to ensure portability in the future.
第一个参数指定参数缓冲区的对齐要求,第二个参数指定字节大小要求。在当前实现中, cudaGetParameterBuffer() 返回的参数缓冲区始终保证是 64 字节对齐的,对齐要求参数会被忽略。然而,建议传递正确的对齐要求值 - 即要放置在参数缓冲区中的任何参数的最大对齐值 - 到 cudaGetParameterBuffer() ,以确保未来的可移植性。

9.3.2.2. Parameter Buffer Layout
9.3.2.2. 参数缓冲区布局 

Parameter reordering in the parameter buffer is prohibited, and each individual parameter placed in the parameter buffer is required to be aligned. That is, each parameter must be placed at the nth byte in the parameter buffer, where n is the smallest multiple of the parameter size that is greater than the offset of the last byte taken by the preceding parameter. The maximum size of the parameter buffer is 4KB.
在参数缓冲区中禁止重新排序参数,并要求将放置在参数缓冲区中的每个单独参数对齐。也就是说,每个参数必须放置在参数缓冲区中的第 n th 字节处,其中 n 是大于前一个参数占用的最后一个字节的偏移量的参数大小的最小倍数。参数缓冲区的最大大小为 4KB。

For a more detailed description of PTX code generated by the CUDA compiler, please refer to the PTX-3.5 specification.
有关 CUDA 编译器生成的 PTX 代码的更详细描述,请参阅 PTX-3.5 规范。

9.3.3. Toolkit Support for Dynamic Parallelism
9.3.3. 动态并行性的工具支持 

9.3.3.1. Including Device Runtime API in CUDA Code
9.3.3.1. 在 CUDA 代码中包含设备运行时 API 

Similar to the host-side runtime API, prototypes for the CUDA device runtime API are included automatically during program compilation. There is no need to includecuda_device_runtime_api.h explicitly.
与主机端运行时 API 类似,CUDA 设备运行时 API 的原型在程序编译期间自动包含。无需显式包含 cuda_device_runtime_api.h

9.3.3.2. Compiling and Linking
9.3.3.2. 编译和链接 

When compiling and linking CUDA programs using dynamic parallelism with nvcc, the program will automatically link against the static device runtime library libcudadevrt.
当使用动态并行性编译和链接 CUDA 程序时,程序将自动链接到静态设备运行时库 libcudadevrt

The device runtime is offered as a static library (cudadevrt.lib on Windows, libcudadevrt.a under Linux), against which a GPU application that uses the device runtime must be linked. Linking of device libraries can be accomplished through nvcc and/or nvlink. Two simple examples are shown below.
设备运行时作为静态库提供(在 Windows 上为 cudadevrt.lib ,在 Linux 下为 libcudadevrt.a ),GPU 应用程序必须链接到使用设备运行时的库。设备库的链接可以通过 nvcc 和/或 nvlink 完成。下面显示了两个简单示例。

A device runtime program may be compiled and linked in a single step, if all required source files can be specified from the command line:
设备运行时程序可以在单个步骤中编译和链接,如果所有必需的源文件都可以从命令行中指定:

$ nvcc -arch=sm_75 -rdc=true hello_world.cu -o hello -lcudadevrt

It is also possible to compile CUDA .cu source files first to object files, and then link these together in a two-stage process:
也可以先将 CUDA .cu 源文件编译为目标文件,然后在两阶段过程中将它们链接在一起:

$ nvcc -arch=sm_75 -dc hello_world.cu -o hello_world.o
$ nvcc -arch=sm_75 -rdc=true hello_world.o -o hello -lcudadevrt

Please see the Using Separate Compilation section of The CUDA Driver Compiler NVCC guide for more details.
请查看 CUDA 驱动程序编译器 NVCC 指南的“使用单独编译”部分,以获取更多详细信息。

9.4. Programming Guidelines
9.4. 编程指南 

9.4.1. Basics 9.4.1. 基础 

The device runtime is a functional subset of the host runtime. API level device management, kernel launching, device memcpy, stream management, and event management are exposed from the device runtime.
设备运行时是主机运行时的功能子集。设备运行时公开了 API 级设备管理、内核启动、设备内存拷贝、流管理和事件管理。

Programming for the device runtime should be familiar to someone who already has experience with CUDA. Device runtime syntax and semantics are largely the same as that of the host API, with any exceptions detailed earlier in this document.
设备运行时的编程对于已经具有 CUDA 经验的人来说应该是熟悉的。设备运行时的语法和语义与主机 API 基本相同,任何异常情况都在本文档中有详细说明。

The following example shows a simple Hello World program incorporating dynamic parallelism:
以下示例显示了一个简单的 Hello World 程序,其中包含动态并行性:

#include <stdio.h>

__global__ void childKernel()
{
    printf("Hello ");
}

__global__ void tailKernel()
{
    printf("World!\n");
}

__global__ void parentKernel()
{
    // launch child
    childKernel<<<1,1>>>();
    if (cudaSuccess != cudaGetLastError()) {
        return;
    }

    // launch tail into cudaStreamTailLaunch stream
    // implicitly synchronizes: waits for child to complete
    tailKernel<<<1,1,0,cudaStreamTailLaunch>>>();

}

int main(int argc, char *argv[])
{
    // launch parent
    parentKernel<<<1,1>>>();
    if (cudaSuccess != cudaGetLastError()) {
        return 1;
    }

    // wait for parent to complete
    if (cudaSuccess != cudaDeviceSynchronize()) {
        return 2;
    }

    return 0;
}

This program may be built in a single step from the command line as follows:
此程序可以通过以下命令行一步构建

$ nvcc -arch=sm_75 -rdc=true hello_world.cu -o hello -lcudadevrt

9.4.2. Performance 9.4.2. 性能 

9.4.2.1. Dynamic-parallelism-enabled Kernel Overhead
9.4.2.1. 动态并行启用的内核开销 

System software which is active when controlling dynamic launches may impose an overhead on any kernel which is running at the time, whether or not it invokes kernel launches of its own. This overhead arises from the device runtime’s execution tracking and management software and may result in decreased performance. This overhead is, in general, incurred for applications that link against the device runtime library.
当控制动态启动时处于活动状态的系统软件可能会对任何正在运行的内核施加负担,无论它是否调用自己的内核启动。这种开销源自设备运行时的执行跟踪和管理软件,可能导致性能下降。通常,这种开销发生在链接到设备运行时库的应用程序中。

9.4.3. Implementation Restrictions and Limitations
9.4.3. 实现限制和限制 

Dynamic Parallelism guarantees all semantics described in this document, however, certain hardware and software resources are implementation-dependent and limit the scale, performance and other properties of a program which uses the device runtime.
动态并行性保证了本文档中描述的所有语义,但是某些硬件和软件资源是实现相关的,限制了使用设备运行时的程序的规模、性能和其他属性。

9.4.3.1. Runtime 9.4.3.1. 运行时 

9.4.3.1.1. Memory Footprint
9.4.3.1.1. 内存占用

The device runtime system software reserves memory for various management purposes, in particular a reservation for tracking pending grid launches. Configuration controls are available to reduce the size of this reservation in exchange for certain launch limitations. See Configuration Options, below, for details.
设备运行时系统软件为各种管理目的保留内存,特别是为跟踪待处理的网格启动而保留的内存。 可用的配置控件可减小此保留量,但会带来某些启动限制。 有关详细信息,请参见下面的配置选项。

9.4.3.1.2. Pending Kernel Launches
9.4.3.1.2. 待处理的内核启动 

When a kernel is launched, all associated configuration and parameter data is tracked until the kernel completes. This data is stored within a system-managed launch pool.
当启动内核时,直到内核完成,所有关联的配置和参数数据都将被跟踪。这些数据存储在系统管理的启动池中。

The size of the fixed-size launch pool is configurable by calling cudaDeviceSetLimit() from the host and specifying cudaLimitDevRuntimePendingLaunchCount.
固定大小的启动池大小可通过从主机调用 cudaDeviceSetLimit() 并指定 cudaLimitDevRuntimePendingLaunchCount 进行配置。

9.4.3.1.3. Configuration Options
9.4.3.1.3. 配置选项 

Resource allocation for the device runtime system software is controlled via the cudaDeviceSetLimit() API from the host program. Limits must be set before any kernel is launched, and may not be changed while the GPU is actively running programs.
设备运行时系统软件的资源分配是通过主机程序中的 cudaDeviceSetLimit() API 控制的。必须在启动任何内核之前设置限制,并且在 GPU 正在运行程序时不得更改。

The following named limits may be set:
以下命名限制可能被设置:

Limit 限制

Behavior 行为

cudaLimitDevRuntimePendingLaunchCount

Controls the amount of memory set aside for buffering kernel launches and events which have not yet begun to execute, due either to unresolved dependencies or lack of execution resources. When the buffer is full, an attempt to allocate a launch slot during a device side kernel launch will fail and return cudaErrorLaunchOutOfResources, while an attempt to allocate an event slot will fail and return cudaErrorMemoryAllocation. The default number of launch slots is 2048. Applications may increase the number of launch and/or event slots by setting cudaLimitDevRuntimePendingLaunchCount. The number of event slots allocated is twice the value of that limit.
控制为缓冲内核启动和尚未开始执行的事件保留的内存量,这可能是由于未解决的依赖关系或执行资源不足。当缓冲区已满时,在设备端内核启动期间尝试分配启动插槽将失败并返回 cudaErrorLaunchOutOfResources ,而尝试分配事件插槽将失败并返回 cudaErrorMemoryAllocation 。默认的启动插槽数量为 2048。应用程序可以通过设置 cudaLimitDevRuntimePendingLaunchCount 来增加启动和/或事件插槽的数量。分配的事件插槽数量是该限制值的两倍。

cudaLimitStackSize

Controls the stack size in bytes of each GPU thread. The CUDA driver automatically increases the per-thread stack size for each kernel launch as needed. This size isn’t reset back to the original value after each launch. To set the per-thread stack size to a different value, cudaDeviceSetLimit() can be called to set this limit. The stack will be immediately resized, and if necessary, the device will block until all preceding requested tasks are complete. cudaDeviceGetLimit() can be called to get the current per-thread stack size.
控制每个 GPU 线程的堆栈大小(以字节为单位)。CUDA 驱动程序会根据需要自动增加每个内核启动的每个线程的堆栈大小。此大小在每次启动后不会重置回原始值。要将每个线程的堆栈大小设置为不同的值,可以调用 cudaDeviceSetLimit() 来设置此限制。堆栈将立即调整大小,如有必要,设备将阻塞,直到所有先前请求的任务完成。可以调用 cudaDeviceGetLimit() 来获取当前每个线程的堆栈大小。

9.4.3.1.4. Memory Allocation and Lifetime
9.4.3.1.4. 内存分配和生命周期 

cudaMalloc() and cudaFree() have distinct semantics between the host and device environments. When invoked from the host, cudaMalloc() allocates a new region from unused device memory. When invoked from the device runtime these functions map to device-side malloc() and free(). This implies that within the device environment the total allocatable memory is limited to the device malloc() heap size, which may be smaller than the available unused device memory. Also, it is an error to invoke cudaFree() from the host program on a pointer which was allocated by cudaMalloc() on the device or vice-versa.
cudaMalloc()cudaFree() 在主机和设备环境之间具有不同的语义。当从主机调用时, cudaMalloc() 会从未使用的设备内存中分配一个新区域。当从设备运行时调用这些函数时,这些函数会映射到设备端的 malloc()free() 。这意味着在设备环境中,可分配的总内存受限于设备 malloc() 堆大小,这可能小于可用的未使用设备内存。此外,在主机程序上调用 cudaFree() 对由设备上的 cudaMalloc() 分配的指针或反之则是错误的。

cudaMalloc() on Host  cudaMalloc() 在主机上

cudaMalloc() on Device  cudaMalloc() 在设备上

cudaFree() on Host  cudaFree() 在主机上

Supported 支持的

Not Supported 不支持

cudaFree() on Device  cudaFree() 在设备上

Not Supported 不支持

Supported 支持的

Allocation limit 分配限制

Free device memory 释放设备内存

cudaLimitMallocHeapSize

9.4.3.1.5. SM Id and Warp Id
9.4.3.1.5. SM Id 和 Warp Id 

Note that in PTX %smid and %warpid are defined as volatile values. The device runtime may reschedule thread blocks onto different SMs in order to more efficiently manage resources. As such, it is unsafe to rely upon %smid or %warpid remaining unchanged across the lifetime of a thread or thread block.
请注意,在 PTX 中, %smid%warpid 被定义为易失性值。设备运行时可能会重新调度线程块到不同的 SM 以更有效地管理资源。因此,在线程或线程块的生命周期内依赖于 %smid%warpid 保持不变是不安全的。

9.4.3.1.6. ECC Errors
9.4.3.1.6. ECC 错误 

No notification of ECC errors is available to code within a CUDA kernel. ECC errors are reported at the host side once the entire launch tree has completed. Any ECC errors which arise during execution of a nested program will either generate an exception or continue execution (depending upon error and configuration).
CUDA 内核中的代码无法获得 ECC 错误的通知。一旦整个启动树完成,ECC 错误将在主机端报告。在执行嵌套程序期间出现的任何 ECC 错误都将生成异常或继续执行(取决于错误和配置)。

9.5. CDP2 vs CDP1

This section summarises the differences between, and the compatibility and interoperability of, the new (CDP2) and legacy (CDP1) CUDA Dynamic Parallelism interfaces. It also shows how to opt-out of the CDP2 interface on devices of compute capability less than 9.0.
本节总结了新(CDP2)和旧(CDP1)CUDA 动态并行接口之间的区别、兼容性和互操作性。还展示了如何在计算能力低于 9.0 的设备上选择退出 CDP2 接口。

9.5.1. Differences Between CDP1 and CDP2
9.5.1. CDP1 和 CDP2 之间的区别 

Explicit device-side synchronization is no longer possible with CDP2 or on devices of compute capability 9.0 or higher. Implicit synchronization (such as tail launches) must be used instead.
CDP2 或计算能力为 9.0 或更高的设备上不再支持显式的设备端同步。必须改用隐式同步(例如尾部启动)。

Attempting to query or set cudaLimitDevRuntimeSyncDepth (or CU_LIMIT_DEV_RUNTIME_SYNC_DEPTH) with CDP2 or on devices of compute capability 9.0 or higher results in cudaErrorUnsupportedLimit.
尝试使用 CDP2 查询或设置 cudaLimitDevRuntimeSyncDepth (或 CU_LIMIT_DEV_RUNTIME_SYNC_DEPTH )或在计算能力为 9.0 或更高的设备上会导致 cudaErrorUnsupportedLimit

CDP2 no longer has a virtualized pool for pending launches that don’t fit in the fixed-sized pool. cudaLimitDevRuntimePendingLaunchCount must be set to be large enough to avoid running out of launch slots.
CDP2 不再具有用于无法适应固定大小池的待处理启动的虚拟化池。必须设置 cudaLimitDevRuntimePendingLaunchCount 足够大,以避免耗尽启动插槽。

For CDP2, there is a limit to the total number of events existing at once (note that events are destroyed only after a launch completes), equal to twice the pending launch count. cudaLimitDevRuntimePendingLaunchCount must be set to be large enough to avoid running out of event slots.
对于 CDP2,同时存在的事件总数有限(请注意,事件仅在启动完成后才被销毁),等于待处理启动计数的两倍。 cudaLimitDevRuntimePendingLaunchCount 必须设置得足够大,以避免耗尽事件槽位。

Streams are tracked per grid with CDP2 or on devices of compute capability 9.0 or higher, not per thread block. This allows work to be launched into a stream created by another thread block. Attempting to do so with the CDP1 results in cudaErrorInvalidValue.
流在每个网格上使用 CDP2 进行跟踪,或者在计算能力为 9.0 或更高的设备上进行跟踪,而不是在每个线程块上进行跟踪。这允许将工作启动到由另一个线程块创建的流中。尝试在 CDP1 中这样做会导致 cudaErrorInvalidValue

CDP2 introduces the tail launch (cudaStreamTailLaunch) and fire-and-forget (cudaStreamFireAndForget) named streams.
CDP2 引入了尾部启动 ( cudaStreamTailLaunch ) 和点火并忘记 ( cudaStreamFireAndForget ) 命名流。

CDP2 is supported only under 64-bit compilation mode.
CDP2 仅在 64 位编译模式下受支持。

9.5.2. Compatibility and Interoperability
9.5.2. 兼容性和互操作性 

CDP2 is the default. Functions can be compiled with -DCUDA_FORCE_CDP1_IF_SUPPORTED to opt-out of using CDP2 on devices of compute capability less than 9.0.
CDP2 是默认设置。可以使用 -DCUDA_FORCE_CDP1_IF_SUPPORTED 编译功能,以便在计算能力低于 9.0 的设备上选择退出使用 CDP2。

Function compiler with CUDA 12.0 and newer (default)
使用 CUDA 12.0 及更新版本(默认)的函数编译器

Function compiled with pre-CUDA 12.0 or with CUDA 12.0 and newer with -DCUDA_FORCE_CDP1_IF_SUPPORTED specified
使用预先 CUDA 12.0 编译的函数,或者使用 CUDA 12.0 及更新版本,并指定 -DCUDA_FORCE_CDP1_IF_SUPPORTED

Compilation 编译

Compile error if device code references cudaDeviceSynchronize.
如果设备代码引用 cudaDeviceSynchronize 则编译错误。

Compile error if code references cudaStreamTailLaunch or cudaStreamFireAndForget. Compile error if device code references cudaDeviceSynchronize and code is compiled for sm_90 or newer.
如果代码引用 cudaStreamTailLaunchcudaStreamFireAndForget ,则编译错误。如果设备代码引用 cudaDeviceSynchronize 并且代码编译为 sm_90 或更新版本,则编译错误。

Compute capability < 9.0 计算能力 < 9.0

New interface is used. 新接口已使用。

Legacy interface is used.
使用了传统接口。

Compute capability 9.0 and higher
计算能力 9.0 及更高

New interface is used. 新接口已使用。

New interface is used. If function references cudaDeviceSynchronize in device code, function load returns cudaErrorSymbolNotFound (this could happen if the code is compiled for devices of compute capability less than 9.0, but run on devices of compute capability 9.0 or higher using JIT).
新接口已使用。如果设备代码中的函数引用 cudaDeviceSynchronize ,则函数加载返回 cudaErrorSymbolNotFound (如果代码编译为计算能力小于 9.0 的设备运行在计算能力为 9.0 或更高的设备上使用 JIT 时可能会发生这种情况)。

Functions using CDP1 and CDP2 may be loaded and run simultaneously in the same context. The CDP1 functions are able to use CDP1-specific features (e.g. cudaDeviceSynchronize) and CDP2 functions are able to use CDP2-specific features (e.g. tail launch and fire-and-forget launch).
在相同上下文中,可以同时加载和运行使用 CDP1 和 CDP2 的功能。CDP1 功能可以使用 CDP1 特定功能(例如 cudaDeviceSynchronize ),而 CDP2 功能可以使用 CDP2 特定功能(例如尾部启动和即启即忘启动)。

A function using CDP1 cannot launch a function using CDP2, and vice versa. If a function that would use CDP1 contains in its call graph a function that would use CDP2, or vice versa, cudaErrorCdpVersionMismatch would result during function load.
使用 CDP1 的函数无法启动使用 CDP2 的函数,反之亦然。如果一个将使用 CDP1 的函数在其调用图中包含一个将使用 CDP2 的函数,或者反之,则在函数加载期间会导致 cudaErrorCdpVersionMismatch

9.6. Legacy CUDA Dynamic Parallelism (CDP1)
9.6. 传统 CUDA 动态并行性 (CDP1) 

See CUDA Dynamic Parallelism, above, for CDP2 version of document.
请参阅上面的 CUDA 动态并行性,获取文档的 CDP2 版本。

9.6.1. Execution Environment and Memory Model (CDP1)
9.6.1. 执行环境和内存模型(CDP1) 

See Execution Environment and Memory Model, above, for CDP2 version of document.
请参阅上文的执行环境和内存模型,以获取文档的 CDP2 版本。

9.6.1.1. Execution Environment (CDP1)
9.6.1.1. 执行环境(CDP1) 

See Execution Environment, above, for CDP2 version of document.
请参阅上文中的执行环境,以获取文档的 CDP2 版本。

The CUDA execution model is based on primitives of threads, thread blocks, and grids, with kernel functions defining the program executed by individual threads within a thread block and grid. When a kernel function is invoked the grid’s properties are described by an execution configuration, which has a special syntax in CUDA. Support for dynamic parallelism in CUDA extends the ability to configure, launch, and synchronize upon new grids to threads that are running on the device.
CUDA 执行模型基于线程、线程块和网格的基本原语,核函数定义了线程块和网格内各个线程执行的程序。当调用核函数时,网格的属性由执行配置描述,CUDA 中具有特殊语法。CUDA 中对动态并行性的支持扩展了配置、启动和同步新网格的能力,以及在设备上运行的线程。

Warning 警告

Explicit synchronization with child kernels from a parent block (i.e. using cudaDeviceSynchronize() in device code) block is deprecated in CUDA 11.6, removed for compute_90+ compilation, and is slated for full removal in a future CUDA release.
在 CUDA 11.6 中,从父块(即在设备代码中使用 cudaDeviceSynchronize() )显式与子内核同步的功能已被弃用,对于 compute_90+编译已被移除,并计划在未来的 CUDA 版本中完全移除。

9.6.1.1.1. Parent and Child Grids (CDP1)
9.6.1.1.1. 父子网格(CDP1) 

See Parent and Child Grids, above, for CDP2 version of document.
请参阅上面的父子网格,以查看文档的 CDP2 版本。

A device thread that configures and launches a new grid belongs to the parent grid, and the grid created by the invocation is a child grid.
一个设备线程,用于配置和启动一个新的网格,属于父网格,并且调用创建的网格是子网格。

The invocation and completion of child grids is properly nested, meaning that the parent grid is not considered complete until all child grids created by its threads have completed. Even if the invoking threads do not explicitly synchronize on the child grids launched, the runtime guarantees an implicit synchronization between the parent and child.
子网格的调用和完成是正确嵌套的,这意味着父网格直到其线程创建的所有子网格完成后才被视为完成。即使调用线程没有明确在启动的子网格上同步,运行时也保证父网格和子网格之间存在隐式同步。

Warning 警告

Explicit synchronization with child kernels from a parent block (i.e. using cudaDeviceSynchronize() in device code) is deprecated in CUDA 11.6, removed for compute_90+ compilation, and is slated for full removal in a future CUDA release.
在 CUDA 11.6 中,从父块(即在设备代码中使用 cudaDeviceSynchronize() )显式与子内核同步已被弃用,对于 compute_90+编译已被移除,并计划在未来的 CUDA 版本中完全移除。

The GPU Devotes More Transistors to Data Processing

Figure 27 Parent-Child Launch Nesting
图 27 父子启动嵌套 

9.6.1.1.2. Scope of CUDA Primitives (CDP1)
9.6.1.1.2. CUDA 原语的范围(CDP1) 

See Scope of CUDA Primitives, above, for CDP2 version of document.
请参阅上文中的 CUDA 基元范围,获取文档的 CDP2 版本。

On both host and device, the CUDA runtime offers an API for launching kernels, for waiting for launched work to complete, and for tracking dependencies between launches via streams and events. On the host system, the state of launches and the CUDA primitives referencing streams and events are shared by all threads within a process; however processes execute independently and may not share CUDA objects.
在主机和设备上,CUDA 运行时提供了一个用于启动内核、等待启动的工作完成以及通过流和事件跟踪启动之间依赖关系的 API。在主机系统上,启动的状态以及引用流和事件的 CUDA 原语由进程内的所有线程共享;但是进程是独立执行的,可能不共享 CUDA 对象。

A similar hierarchy exists on the device: launched kernels and CUDA objects are visible to all threads in a thread block, but are independent between thread blocks. This means for example that a stream may be created by one thread and used by any other thread in the same thread block, but may not be shared with threads in any other thread block.
设备上存在类似的层次结构:启动的内核和 CUDA 对象对线程块中的所有线程可见,但在线程块之间是独立的。这意味着,例如,一个流可能由一个线程创建,并且可以被同一线程块中的任何其他线程使用,但不能与任何其他线程块中的线程共享。

9.6.1.1.3. Synchronization (CDP1)
9.6.1.1.3. 同步(CDP1) 

See Synchronization, above, for CDP2 version of document.
请参阅上面的同步,以获取文档的 CDP2 版本。

Warning 警告

Explicit synchronization with child kernels from a parent block (i.e. using cudaDeviceSynchronize() in device code) is deprecated in CUDA 11.6, removed for compute_90+ compilation, and is slated for full removal in a future CUDA release.
在 CUDA 11.6 中,从父块(即在设备代码中使用 cudaDeviceSynchronize() )显式与子内核同步已被弃用,对于 compute_90+编译已被移除,并计划在未来的 CUDA 版本中完全移除。

CUDA runtime operations from any thread, including kernel launches, are visible across a thread block. This means that an invoking thread in the parent grid may perform synchronization on the grids launched by that thread, by other threads in the thread block, or on streams created within the same thread block. Execution of a thread block is not considered complete until all launches by all threads in the block have completed. If all threads in a block exit before all child launches have completed, a synchronization operation will automatically be triggered.
任何线程的 CUDA 运行时操作,包括内核启动,都可以在线程块中可见。这意味着父网格中的调用线程可以对该线程启动的网格,线程块中的其他线程,或者在同一线程块中创建的流执行同步。线程块的执行在所有块中的所有线程的启动完成之前不被视为完成。如果所有块中的所有线程在所有子启动完成之前退出,则将自动触发同步操作。

9.6.1.1.4. Streams and Events (CDP1)
9.6.1.1.4. 流和事件(CDP1) 

See Streams and Events, above, for CDP2 version of document.
请参阅上面的“流和事件”部分,了解文档的 CDP2 版本。

CUDA Streams and Events allow control over dependencies between grid launches: grids launched into the same stream execute in-order, and events may be used to create dependencies between streams. Streams and events created on the device serve this exact same purpose.
CUDA 流和事件允许控制网格启动之间的依赖关系:启动到同一流的网格按顺序执行,并且事件可用于在流之间创建依赖关系。在设备上创建的流和事件具有完全相同的目的。

Streams and events created within a grid exist within thread block scope but have undefined behavior when used outside of the thread block where they were created. As described above, all work launched by a thread block is implicitly synchronized when the block exits; work launched into streams is included in this, with all dependencies resolved appropriately. The behavior of operations on a stream that has been modified outside of thread block scope is undefined.
在网格内创建的流和事件存在于线程块范围内,但在超出创建它们的线程块范围外使用时具有未定义的行为。如上所述,由线程块启动的所有工作在块退出时会隐式同步;启动到流中的工作也包括在内,所有依赖项都会得到适当解决。在超出线程块范围外修改的流上执行的操作的行为是未定义的。

Streams and events created on the host have undefined behavior when used within any kernel, just as streams and events created by a parent grid have undefined behavior if used within a child grid.
在主机上创建的流和事件在任何内核中使用时具有未定义的行为,就像由父网格创建的流和事件在子网格中使用时具有未定义的行为一样。

9.6.1.1.5. Ordering and Concurrency (CDP1)
9.6.1.1.5. 排序和并发性(CDP1) 

See Ordering and Concurrency, above, for CDP2 version of document.
请参阅上文中的“订购和并发性”部分,了解文档的 CDP2 版本。

The ordering of kernel launches from the device runtime follows CUDA Stream ordering semantics. Within a thread block, all kernel launches into the same stream are executed in-order. With multiple threads in the same thread block launching into the same stream, the ordering within the stream is dependent on the thread scheduling within the block, which may be controlled with synchronization primitives such as __syncthreads().
设备运行时的内核启动顺序遵循 CUDA 流排序语义。在线程块内,所有进入同一流的内核启动按顺序执行。在线程块中有多个线程进入同一流进行启动时,流内的顺序取决于块内的线程调度,可以使用同步原语(如 __syncthreads() )来控制。

Note that because streams are shared by all threads within a thread block, the implicit NULL stream is also shared. If multiple threads in a thread block launch into the implicit stream, then these launches will be executed in-order. If concurrency is desired, explicit named streams should be used.
请注意,由于流在线程块内的所有线程之间共享,隐式的 NULL 流也是共享的。如果线程块中的多个线程启动到隐式流中,则这些启动将按顺序执行。如果需要并发,则应使用显式命名流。

Dynamic Parallelism enables concurrency to be expressed more easily within a program; however, the device runtime introduces no new concurrency guarantees within the CUDA execution model. There is no guarantee of concurrent execution between any number of different thread blocks on a device.
动态并行性使得在程序中更容易表达并发性;然而,在 CUDA 执行模型中,设备运行时并未引入新的并发性保证。在设备上,不保证任意数量的不同线程块之间的并发执行。

The lack of concurrency guarantee extends to parent thread blocks and their child grids. When a parent thread block launches a child grid, the child is not guaranteed to begin execution until the parent thread block reaches an explicit synchronization point (such as cudaDeviceSynchronize()).
并发性保证的缺失延伸到父线程块及其子网格。当父线程块启动子网格时,子网格不保证开始执行,直到父线程块达到显式同步点(例如 cudaDeviceSynchronize() )。

Warning 警告

Explicit synchronization with child kernels from a parent block (i.e. using cudaDeviceSynchronize() in device code) is deprecated in CUDA 11.6, removed for compute_90+ compilation, and is slated for full removal in a future CUDA release.
在 CUDA 11.6 中,从父块(即在设备代码中使用 cudaDeviceSynchronize() )显式与子内核同步已被弃用,对于 compute_90+编译已被移除,并计划在未来的 CUDA 版本中完全移除。

While concurrency will often easily be achieved, it may vary as a function of deviceconfiguration, application workload, and runtime scheduling. It is therefore unsafe to depend upon any concurrency between different thread blocks.
尽管并发通常很容易实现,但它可能会因设备配置、应用程序工作负载和运行时调度的不同而变化。因此,依赖于不同线程块之间的任何并发是不安全的。

9.6.1.1.6. Device Management (CDP1)
9.6.1.1.6. 设备管理(CDP1) 

See Device Management, above, for CDP2 version of document.
请参阅上面的设备管理,获取文档的 CDP2 版本。

There is no multi-GPU support from the device runtime; the device runtime is only capable of operating on the device upon which it is currently executing. It is permitted, however, to query properties for any CUDA capable device in the system.
设备运行时不支持多 GPU 支持;设备运行时只能在当前执行的设备上运行。但是,可以查询系统中任何支持 CUDA 的设备的属性。

9.6.1.2. Memory Model (CDP1)
9.6.1.2. 内存模型(CDP1) 

See Memory Model, above, for CDP2 version of document.
请参阅上文的内存模型,以获取文档的 CDP2 版本。

Parent and child grids share the same global and constant memory storage, but have distinct local and shared memory.
父子网格共享相同的全局和常量内存存储,但具有不同的本地和共享内存。

9.6.1.2.1. Coherence and Consistency (CDP1)
9.6.1.2.1. 一致性和一致性(CDP1) 

See Coherence and Consistency, above, for CDP2 version of document.
请参阅上文的一致性和连贯性,以获取文档的 CDP2 版本。

9.6.1.2.1.1. Global Memory (CDP1)
9.6.1.2.1.1. 全局内存(CDP1) 

See Global Memory, above, for CDP2 version of document.
请参阅上面的全局内存,以获取文档的 CDP2 版本。

Parent and child grids have coherent access to global memory, with weak consistency guarantees between child and parent. There are two points in the execution of a child grid when its view of memory is fully consistent with the parent thread: when the child grid is invoked by the parent, and when the child grid completes as signaled by a synchronization API invocation in the parent thread.
父子网格可以一致地访问全局内存,在子网格和父网格之间提供弱一致性保证。在子网格的执行过程中,有两个时间点子网格对内存的视图与父线程完全一致:当子网格被父网格调用时,以及当子网格在父线程中通过同步 API 调用完成时。

Warning 警告

Explicit synchronization with child kernels from a parent block (i.e. using cudaDeviceSynchronize() in device code) is deprecated in CUDA 11.6, removed for compute_90+ compilation, and is slated for full removal in a future CUDA release.
在 CUDA 11.6 中,从父块(即在设备代码中使用 cudaDeviceSynchronize() )显式与子内核同步已被弃用,对于 compute_90+编译已被移除,并计划在未来的 CUDA 版本中完全移除。

All global memory operations in the parent thread prior to the child grid’s invocation are visible to the child grid. All memory operations of the child grid are visible to the parent after the parent has synchronized on the child grid’s completion.
父线程在调用子网格之前的所有全局内存操作对子网格可见。父线程在子网格完成后同步后,子网格的所有内存操作对父线程可见。

In the following example, the child grid executing child_launch is only guaranteed to see the modifications to data made before the child grid was launched. Since thread 0 of the parent is performing the launch, the child will be consistent with the memory seen by thread 0 of the parent. Due to the first __syncthreads() call, the child will see data[0]=0, data[1]=1, …, data[255]=255 (without the __syncthreads() call, only data[0] would be guaranteed to be seen by the child). When the child grid returns, thread 0 is guaranteed to see modifications made by the threads in its child grid. Those modifications become available to the other threads of the parent grid only after the second __syncthreads() call:
在以下示例中,执行 child_launch 的子网格只能保证看到在启动子网格之前对 data 所做的修改。由于父级的线程 0 正在执行启动,子级将与父级线程 0 看到的内存一致。由于第一个 __syncthreads() 调用,子级将看到 data[0]=0data[1]=1 ,…, data[255]=255 (如果没有 __syncthreads() 调用,只有 data[0] 将被子级看到)。当子网格返回时,线程 0 将保证看到其子网格中线程所做的修改。这些修改仅在第二个 __syncthreads() 调用之后才对父级网格的其他线程可用:

__global__ void child_launch(int *data) {
   data[threadIdx.x] = data[threadIdx.x]+1;
}

__global__ void parent_launch(int *data) {
   data[threadIdx.x] = threadIdx.x;

   __syncthreads();

   if (threadIdx.x == 0) {
       child_launch<<< 1, 256 >>>(data);
       cudaDeviceSynchronize();
   }

   __syncthreads();
}

void host_launch(int *data) {
    parent_launch<<< 1, 256 >>>(data);
}
9.6.1.2.1.2. Zero Copy Memory (CDP1)
9.6.1.2.1.2. 零拷贝内存(CDP1) 

See Zero Copy Memory, above, for CDP2 version of document.
请参阅上文的零拷贝内存,以获取文档的 CDP2 版本。

Zero-copy system memory has identical coherence and consistency guarantees to global memory, and follows the semantics detailed above. A kernel may not allocate or free zero-copy memory, but may use pointers to zero-copy passed in from the host program.
零拷贝系统内存具有与全局内存相同的一致性和一致性保证,并遵循上述详细的语义。内核可能不会分配或释放零拷贝内存,但可以使用从主机程序传入的零拷贝指针。

9.6.1.2.1.3. Constant Memory (CDP1)
9.6.1.2.1.3. 常量内存(CDP1) 

See Constant Memory, above, for CDP2 version of document.
请参阅上面的常量内存,以获取文档的 CDP2 版本。

Constants are immutable and may not be modified from the device, even between parent and child launches. That is to say, the value of all __constant__ variables must be set from the host prior to launch. Constant memory is inherited automatically by all child kernels from their respective parents.
常量是不可变的,甚至在父级和子级启动之间也不能被设备修改。也就是说,所有 __constant__ 变量的值必须在启动之前从主机设置。常量内存会自动从各自的父级继承到所有子内核。

Taking the address of a constant memory object from within a kernel thread has the same semantics as for all CUDA programs, and passing that pointer from parent to child or from a child to parent is naturally supported.
在内核线程中获取常量内存对象的地址与所有 CUDA 程序的语义相同,并且自然支持将该指针从父级传递到子级或从子级传递到父级。

9.6.1.2.1.4. Shared and Local Memory (CDP1)
9.6.1.2.1.4. 共享和本地内存(CDP1) 

See Shared and Local Memory, above, for CDP2 version of document.
请参阅上面的共享内存和本地内存,以获取文档的 CDP2 版本。

Shared and Local memory is private to a thread block or thread, respectively, and is not visible or coherent between parent and child. Behavior is undefined when an object in one of these locations is referenced outside of the scope within which it belongs, and may cause an error.
共享内存和本地内存分别是线程块或线程私有的,不在父级和子级之间可见或一致。当在其所属范围之外引用这些位置中的对象时,行为是未定义的,并可能导致错误。

The NVIDIA compiler will attempt to warn if it can detect that a pointer to local or shared memory is being passed as an argument to a kernel launch. At runtime, the programmer may use the __isGlobal() intrinsic to determine whether a pointer references global memory and so may safely be passed to a child launch.
NVIDIA 编译器将尝试发出警告,如果它检测到指向本地或共享内存的指针被传递为内核启动的参数。在运行时,程序员可以使用 __isGlobal() 内部函数来确定指针是否引用全局内存,因此可以安全地传递给子启动。

Note that calls to cudaMemcpy*Async() or cudaMemset*Async() may invoke new child kernels on the device in order to preserve stream semantics. As such, passing shared or local memory pointers to these APIs is illegal and will return an error.
请注意,调用 cudaMemcpy*Async()cudaMemset*Async() 可能会在设备上调用新的子内核,以保留流语义。因此,将共享或本地内存指针传递给这些 API 是非法的,并将返回错误。

9.6.1.2.1.5. Local Memory (CDP1)
9.6.1.2.1.5. 本地内存(CDP1) 

See Local Memory, above, for CDP2 version of document.
请参阅上面的本地内存,以获取文档的 CDP2 版本。

Local memory is private storage for an executing thread, and is not visible outside of that thread. It is illegal to pass a pointer to local memory as a launch argument when launching a child kernel. The result of dereferencing such a local memory pointer from a child will be undefined.
本地内存是执行线程的私有存储,不会对线程外部可见。在启动子内核时,将本地内存指针作为启动参数传递是非法的。从子内核对此类本地内存指针进行解引用的结果将是未定义的。

For example the following is illegal, with undefined behavior if x_array is accessed by child_launch:
例如,如果通过 child_launch 访问 x_array ,则以下操作是非法的,具有未定义的行为:

int x_array[10];       // Creates x_array in parent's local memory
child_launch<<< 1, 1 >>>(x_array);

It is sometimes difficult for a programmer to be aware of when a variable is placed into local memory by the compiler. As a general rule, all storage passed to a child kernel should be allocated explicitly from the global-memory heap, either with cudaMalloc(), new() or by declaring __device__ storage at global scope. For example:
有时,程序员很难意识到编译器何时将变量放入本地内存。一般原则是,传递给子内核的所有存储都应明确从全局内存堆中分配,可以使用 cudaMalloc()new() 或通过在全局范围声明 __device__ 存储。例如:

// Correct - "value" is global storage
__device__ int value;
__device__ void x() {
    value = 5;
    child<<< 1, 1 >>>(&value);
}
// Invalid - "value" is local storage
__device__ void y() {
    int value = 5;
    child<<< 1, 1 >>>(&value);
}
9.6.1.2.1.6. Texture Memory (CDP1)
9.6.1.2.1.6. 纹理内存(CDP1) 

See Texture Memory, above, for CDP2 version of document.
请参阅上文的“纹理内存”部分,以获取文档的 CDP2 版本。

Writes to the global memory region over which a texture is mapped are incoherent with respect to texture accesses. Coherence for texture memory is enforced at the invocation of a child grid and when a child grid completes. This means that writes to memory prior to a child kernel launch are reflected in texture memory accesses of the child. Similarly, writes to memory by a child will be reflected in the texture memory accesses by a parent, but only after the parent synchronizes on the child’s completion. Concurrent accesses by parent and child may result in inconsistent data.
对于映射纹理的全局内存区域的写入与纹理访问不一致。纹理内存的一致性在调用子网格时强制执行,并在子网格完成时强制执行。这意味着在启动子内核之前对内存的写入会反映在子级的纹理内存访问中。同样,子级对内存的写入将反映在父级的纹理内存访问中,但只有在父级在子级完成后同步时才会发生。父级和子级的并发访问可能导致数据不一致。

Warning 警告

Explicit synchronization with child kernels from a parent block (i.e. using cudaDeviceSynchronize() in device code) is deprecated in CUDA 11.6, removed for compute_90+ compilation, and is slated for full removal in a future CUDA release.
在 CUDA 11.6 中,从父块(即在设备代码中使用 cudaDeviceSynchronize() )显式与子内核同步已被弃用,对于 compute_90+编译已被移除,并计划在未来的 CUDA 版本中完全移除。

9.6.2. Programming Interface (CDP1)
9.6.2. 编程接口(CDP1) 

See Programming Interface, above, for CDP2 version of document.
请参阅上面的编程接口,获取文档的 CDP2 版本。

9.6.2.1. CUDA C++ Reference (CDP1)
9.6.2.1. CUDA C++ 参考(CDP1) 

See CUDA C++ Reference, above, for CDP2 version of document.
请参阅上面的 CUDA C++ 参考文档,获取文档的 CDP2 版本。

This section describes changes and additions to the CUDA C++ language extensions for supporting Dynamic Parallelism.
本节描述了支持动态并行性的 CUDA C++语言扩展的更改和添加。

The language interface and API available to CUDA kernels using CUDA C++ for Dynamic Parallelism, referred to as the Device Runtime, is substantially like that of the CUDA Runtime API available on the host. Where possible the syntax and semantics of the CUDA Runtime API have been retained in order to facilitate ease of code reuse for routines that may run in either the host or device environments.
使用 CUDA C++ 进行动态并行性的 CUDA 内核可用的语言界面和 API(称为设备运行时)与主机上可用的 CUDA 运行时 API 非常相似。在可能的情况下,已保留了 CUDA 运行时 API 的语法和语义,以便为可能在主机或设备环境中运行的例程提供代码重用的便利。

As with all code in CUDA C++, the APIs and code outlined here is per-thread code. This enables each thread to make unique, dynamic decisions regarding what kernel or operation to execute next. There are no synchronization requirements between threads within a block to execute any of the provided device runtime APIs, which enables the device runtime API functions to be called in arbitrarily divergent kernel code without deadlock.
与 CUDA C++中的所有代码一样,这里概述的 API 和代码是每个线程的代码。这使得每个线程能够就执行下一个内核或操作做出独特的动态决策。在块内部的线程之间没有同步要求来执行任何提供的设备运行时 API,这使得可以在任意分歧的内核代码中调用设备运行时 API 函数,而不会发生死锁。

9.6.2.1.1. Device-Side Kernel Launch (CDP1)
9.6.2.1.1. 设备端内核启动(CDP1) 

See Device-Side Kernel Launch, above, for CDP2 version of document.
请参阅上面的设备端内核启动,获取文档的 CDP2 版本。

Kernels may be launched from the device using the standard CUDA <<< >>> syntax:
内核可以使用标准的 CUDA <<< >>> 语法从设备启动:

kernel_name<<< Dg, Db, Ns, S >>>([kernel arguments]);
  • Dg is of type dim3 and specifies the dimensions and size of the grid
    Dg 的类型为 dim3 ,指定了网格的维度和大小

  • Db is of type dim3 and specifies the dimensions and size of each thread block
    Db 的类型为 dim3 ,并指定了每个线程块的维度和大小

  • Ns is of type size_t and specifies the number of bytes of shared memory that is dynamically allocated per thread block for this call and addition to statically allocated memory. Ns is an optional argument that defaults to 0.
    Ns 的类型为 size_t ,指定了为此调用的每个线程块动态分配的共享内存字节数,以及静态分配内存的附加字节数。 Ns 是一个可选参数,默认值为 0。

  • S is of type cudaStream_t and specifies the stream associated with this call. The stream must have been allocated in the same thread block where the call is being made. S is an optional argument that defaults to 0.
    S 的类型为 cudaStream_t ,指定与此调用关联的流。流必须在进行调用的同一线程块中分配。 S 是一个可选参数,默认值为 0。

9.6.2.1.1.1. Launches are Asynchronous (CDP1)
9.6.2.1.1.1. 启动是异步的(CDP1) 

See Launches are Asynchronous, above, for CDP2 version of document.
请参阅上面的异步启动,以获取文档的 CDP2 版本。

Identical to host-side launches, all device-side kernel launches are asynchronous with respect to the launching thread. That is to say, the <<<>>> launch command will return immediately and the launching thread will continue to execute until it hits an explicit launch-synchronization point such as cudaDeviceSynchronize().
与主机端启动相同,所有设备端内核启动都是相对于启动线程异步的。也就是说, <<<>>> 启动命令会立即返回,启动线程将继续执行,直到遇到显式的启动同步点,比如 cudaDeviceSynchronize()

Warning 警告

Explicit synchronization with child kernels from a parent block (i.e. using cudaDeviceSynchronize() in device code) is deprecated in CUDA 11.6, removed for compute_90+ compilation, and is slated for full removal in a future CUDA release.
在 CUDA 11.6 中,从父块(即在设备代码中使用 cudaDeviceSynchronize() )显式与子内核同步已被弃用,对于 compute_90+编译已被移除,并计划在未来的 CUDA 版本中完全移除。

The grid launch is posted to the device and will execute independently of the parent thread. The child grid may begin execution at any time after launch, but is not guaranteed to begin execution until the launching thread reaches an explicit launch-synchronization point.
网格启动已发布到设备,并将独立于父线程执行。子网格可能在启动后的任何时间开始执行,但不能保证在启动线程达到显式启动同步点之前开始执行。

9.6.2.1.1.2. Launch Environment Configuration (CDP1)
9.6.2.1.1.2. 启动环境配置(CDP1) 

See Launch Environment Configuration, above, for CDP2 version of document.
请参阅上文的启动环境配置,获取文档的 CDP2 版本。

All global device configuration settings (for example, shared memory and L1 cache size as returned from cudaDeviceGetCacheConfig(), and device limits returned from cudaDeviceGetLimit()) will be inherited from the parent. Likewise, device limits such as stack size will remain as-configured.
所有全局设备配置设置(例如,从 cudaDeviceGetCacheConfig() 返回的共享内存和 L1 缓存大小,以及从 cudaDeviceGetLimit() 返回的设备限制)将从父级继承。同样,诸如堆栈大小之类的设备限制将保持配置不变。

For host-launched kernels, per-kernel configurations set from the host will take precedence over the global setting. These configurations will be used when the kernel is launched from the device as well. It is not possible to reconfigure a kernel’s environment from the device.
对于主机启动的内核,主机设置的每个内核配置将优先于全局设置。当从设备启动内核时,将使用这些配置。无法从设备重新配置内核的环境。

9.6.2.1.2. Streams (CDP1)
9.6.2.1.2. 流 (CDP1) 

See Streams, above, for CDP2 version of document.
请参阅上面的 Streams,以获取文档的 CDP2 版本。

Both named and unnamed (NULL) streams are available from the device runtime. Named streams may be used by any thread within a thread-block, but stream handles may not be passed to other blocks or child/parent kernels. In other words, a stream should be treated as private to the block in which it is created. Stream handles are not guaranteed to be unique between blocks, so using a stream handle within a block that did not allocate it will result in undefined behavior.
设备运行时提供了命名流和未命名(NULL)流。 命名流可以被线程块内的任何线程使用,但流句柄不能传递给其他块或子/父内核。 换句话说,流应被视为在其创建的块中私有的。 流句柄在块之间不保证是唯一的,因此在未分配它的块内使用流句柄将导致未定义的行为。

Similar to host-side launch, work launched into separate streams may run concurrently, but actual concurrency is not guaranteed. Programs that depend upon concurrency between child kernels are not supported by the CUDA programming model and will have undefined behavior.
与主机端启动类似,启动到单独流中的工作可能会并发运行,但实际并发性不能保证。依赖于子内核之间并发性的程序不受 CUDA 编程模型支持,并且将具有未定义的行为。

The host-side NULL stream’s cross-stream barrier semantic is not supported on the device (see below for details). In order to retain semantic compatibility with the host runtime, all device streams must be created using the cudaStreamCreateWithFlags() API, passing the cudaStreamNonBlocking flag. The cudaStreamCreate() call is a host-runtime- only API and will fail to compile for the device.
主机端的 NULL 流的跨流屏障语义在设备上不受支持(请参见下文了解详情)。为了保持与主机运行时的语义兼容性,所有设备流必须使用 cudaStreamCreateWithFlags() API 创建,传递 cudaStreamNonBlocking 标志。 cudaStreamCreate() 调用是一个仅限主机运行时的 API,并且无法为设备编译。

As cudaStreamSynchronize() and cudaStreamQuery() are unsupported by the device runtime, cudaDeviceSynchronize() should be used instead when the application needs to know that stream-launched child kernels have completed.
由于设备运行时不支持 cudaStreamSynchronize()cudaStreamQuery() ,因此当应用程序需要知道流启动的子内核何时完成时,应改用 cudaDeviceSynchronize()

Warning 警告

Explicit synchronization with child kernels from a parent block (i.e. using cudaDeviceSynchronize() in device code) is deprecated in CUDA 11.6, removed for compute_90+ compilation, and is slated for full removal in a future CUDA release.
在 CUDA 11.6 中,从父块(即在设备代码中使用 cudaDeviceSynchronize() )显式与子内核同步已被弃用,对于 compute_90+编译已被移除,并计划在未来的 CUDA 版本中完全移除。

9.6.2.1.2.1. The Implicit (NULL) Stream (CDP1)
9.6.2.1.2.1. 隐式(NULL)流(CDP1) 

See The Implicit (NULL) Stream, above, for CDP2 version of document.
请参阅上文中的隐式(NULL)流,以获取文档的 CDP2 版本。

Within a host program, the unnamed (NULL) stream has additional barrier synchronization semantics with other streams (see Default Stream for details). The device runtime offers a single implicit, unnamed stream shared between all threads in a block, but as all named streams must be created with the cudaStreamNonBlocking flag, work launched into the NULL stream will not insert an implicit dependency on pending work in any other streams (including NULL streams of other thread blocks).
在主机程序中,未命名(NULL)流与其他流具有额外的屏障同步语义(请参阅详细信息中的默认流)。设备运行时提供一个隐式的未命名流,在块中的所有线程之间共享,但是由于所有命名流都必须使用 cudaStreamNonBlocking 标志创建,因此在 NULL 流中启动的工作不会对其他流中的挂起工作(包括其他线程块的 NULL 流)产生隐式依赖关系。

9.6.2.1.3. Events (CDP1)
9.6.2.1.3. 事件(CDP1) 

See Events, above, for CDP2 version of document.
请参阅上面的事件,以获取文档的 CDP2 版本。

Only the inter-stream synchronization capabilities of CUDA events are supported. This means that cudaStreamWaitEvent() is supported, but cudaEventSynchronize(), cudaEventElapsedTime(), and cudaEventQuery() are not. As cudaEventElapsedTime() is not supported, cudaEvents must be created via cudaEventCreateWithFlags(), passing the cudaEventDisableTiming flag.
仅支持 CUDA 事件的跨流同步能力。这意味着支持 cudaStreamWaitEvent() ,但不支持 cudaEventSynchronize()cudaEventElapsedTime()cudaEventQuery() 。由于不支持 cudaEventElapsedTime() ,因此必须通过 cudaEventCreateWithFlags() 创建 cudaEvents,传递 cudaEventDisableTiming 标志。

As for all device runtime objects, event objects may be shared between all threads within the thread-block which created them but are local to that block and may not be passed to other kernels, or between blocks within the same kernel. Event handles are not guaranteed to be unique between blocks, so using an event handle within a block that did not create it will result in undefined behavior.
对于所有设备运行时对象,事件对象可能在创建它们的线程块内的所有线程之间共享,但是它们对于该块是局部的,不能传递给其他内核,也不能在同一内核中的不同块之间传递。事件句柄在不同块之间不保证唯一,因此在未创建该句柄的块内使用事件句柄将导致未定义行为。

9.6.2.1.4. Synchronization (CDP1)
9.6.2.1.4. 同步(CDP1) 

See Synchronization, above, for CDP2 version of document.
请参阅上面的同步,以获取文档的 CDP2 版本。

Warning 警告

Explicit synchronization with child kernels from a parent block (i.e. using cudaDeviceSynchronize() in device code) is deprecated in CUDA 11.6, removed for compute_90+ compilation, and is slated for full removal in a future CUDA release.
在 CUDA 11.6 中,从父块(即在设备代码中使用 cudaDeviceSynchronize() )显式与子内核同步已被弃用,对于 compute_90+编译已被移除,并计划在未来的 CUDA 版本中完全移除。

The cudaDeviceSynchronize() function will synchronize on all work launched by any thread in the thread-block up to the point where cudaDeviceSynchronize() was called. Note that cudaDeviceSynchronize() may be called from within divergent code (see Block Wide Synchronization (CDP1)).
函数 cudaDeviceSynchronize() 将在线程块中的任何线程启动的所有工作上进行同步,直到调用 cudaDeviceSynchronize() 的点。请注意, cudaDeviceSynchronize() 可能会从分歧代码中调用(请参阅块范围同步(CDP1))。

It is up to the program to perform sufficient additional inter-thread synchronization, for example via a call to __syncthreads(), if the calling thread is intended to synchronize with child grids invoked from other threads.
程序需要执行足够的额外线程间同步,例如通过调用 __syncthreads() ,如果调用线程打算与从其他线程调用的子网格同步。

9.6.2.1.4.1. Block Wide Synchronization (CDP1)
9.6.2.1.4.1. 块宽同步(CDP1) 

See CUDA Dynamic Parallelism, above, for CDP2 version of document.
请参阅上面的 CUDA 动态并行性,获取文档的 CDP2 版本。

The cudaDeviceSynchronize() function does not imply intra-block synchronization. In particular, without explicit synchronization via a __syncthreads() directive the calling thread can make no assumptions about what work has been launched by any thread other than itself. For example if multiple threads within a block are each launching work and synchronization is desired for all this work at once (perhaps because of event-based dependencies), it is up to the program to guarantee that this work is submitted by all threads before calling cudaDeviceSynchronize().
cudaDeviceSynchronize() 函数不意味着块内同步。特别是,如果没有通过 __syncthreads() 指令进行显式同步,调用线程不能假设除自身之外的任何线程已启动的工作。例如,如果块内的多个线程都在启动工作,并且希望一次对所有这些工作进行同步(可能是因为基于事件的依赖关系),则程序必须确保在调用 cudaDeviceSynchronize() 之前所有线程都已提交了这些工作。

Because the implementation is permitted to synchronize on launches from any thread in the block, it is quite possible that simultaneous calls to cudaDeviceSynchronize() by multiple threads will drain all work in the first call and then have no effect for the later calls.
由于实现允许在块中的任何线程上同步启动,因此很可能多个线程同时调用 cudaDeviceSynchronize() 会在第一次调用中耗尽所有工作,然后对后续调用没有任何影响。

9.6.2.1.5. Device Management (CDP1)
9.6.2.1.5. 设备管理(CDP1) 

See Device Management, above, for CDP2 version of document.
请参阅上面的设备管理,获取文档的 CDP2 版本。

Only the device on which a kernel is running will be controllable from that kernel. This means that device APIs such as cudaSetDevice() are not supported by the device runtime. The active device as seen from the GPU (returned from cudaGetDevice()) will have the same device number as seen from the host system. The cudaDeviceGetAttribute() call may request information about another device as this API allows specification of a device ID as a parameter of the call. Note that the catch-all cudaGetDeviceProperties() API is not offered by the device runtime - properties must be queried individually.
只有运行内核的设备才能从该内核进行控制。这意味着设备运行时不支持诸如 cudaSetDevice() 之类的设备 API。从 GPU 看到的活动设备(从 cudaGetDevice() 返回)将与从主机系统看到的设备编号相同。 cudaDeviceGetAttribute() 调用可能会请求有关另一个设备的信息,因为此 API 允许在调用的参数中指定设备 ID。请注意,设备运行时不提供通用的 cudaGetDeviceProperties() API - 必须逐个查询属性。

9.6.2.1.6. Memory Declarations (CDP1)
9.6.2.1.6. 内存声明(CDP1) 

See Memory Declarations, above, for CDP2 version of document.
请参阅上面的内存声明,以获取文档的 CDP2 版本。

9.6.2.1.6.1. Device and Constant Memory (CDP1)
9.6.2.1.6.1. 设备和常量内存(CDP1) 

See Device and Constant Memory, above, for CDP2 version of document.
请参阅上面的设备和常量内存,以获取文档的 CDP2 版本。

Memory declared at file scope with __device__ or __constant__ memory space specifiers behaves identically when using the device runtime. All kernels may read or write device variables, whether the kernel was initially launched by the host or device runtime. Equivalently, all kernels will have the same view of __constant__s as declared at the module scope.
在文件范围内使用 __device____constant__ 内存空间修饰符声明的内存在使用设备运行时时表现相同。所有内核都可以读取或写入设备变量,无论内核最初是由主机还是设备运行时启动的。同样,所有内核将对在模块范围声明的 __constant__ s 具有相同的视图。

9.6.2.1.6.2. Textures and Surfaces (CDP1)
9.6.2.1.6.2. 纹理和表面(CDP1) 

See Textures and Surfaces, above, for CDP2 version of document.
请参阅上文中的“纹理和表面”部分,以查看文档的 CDP2 版本。

CUDA supports dynamically created texture and surface objects14, where a texture object may be created on the host, passed to a kernel, used by that kernel, and then destroyed from the host. The device runtime does not allow creation or destruction of texture or surface objects from within device code, but texture and surface objects created from the host may be used and passed around freely on the device. Regardless of where they are created, dynamically created texture objects are always valid and may be passed to child kernels from a parent.
CUDA 支持动态创建的纹理和表面对象,其中纹理对象可以在主机上创建,传递给内核,由该内核使用,然后可以从主机销毁。设备运行时不允许在设备代码中创建或销毁纹理或表面对象,但可以在主机上创建的纹理和表面对象可以在设备上自由使用和传递。无论在何处创建,动态创建的纹理对象始终有效,并可以从父级传递给子内核。

Note 注意

The device runtime does not support legacy module-scope (i.e., Fermi-style) textures and surfaces within a kernel launched from the device. Module-scope (legacy) textures may be created from the host and used in device code as for any kernel, but may only be used by a top-level kernel (i.e., the one which is launched from the host).
设备运行时不支持从设备启动的内核中的传统模块范围(即 Fermi 风格)纹理和表面。传统模块范围的纹理可以从主机创建,并在设备代码中用作任何内核,但只能被顶级内核使用(即从主机启动的内核)。

9.6.2.1.6.3. Shared Memory Variable Declarations (CDP1)
9.6.2.1.6.3. 共享内存变量声明(CDP1) 

See Shared Memory Variable Declarations, above, for CDP2 version of document.
请参阅上面的共享内存变量声明,以获取文档的 CDP2 版本。

In CUDA C++ shared memory can be declared either as a statically sized file-scope or function-scoped variable, or as an extern variable with the size determined at runtime by the kernel’s caller via a launch configuration argument. Both types of declarations are valid under the device runtime.
在 CUDA C++中,共享内存可以声明为静态大小的文件作用域或函数作用域变量,也可以声明为一个 extern 变量,其大小由内核的调用者通过启动配置参数在运行时确定。这两种声明类型在设备运行时都是有效的。

__global__ void permute(int n, int *data) {
   extern __shared__ int smem[];
   if (n <= 1)
       return;

   smem[threadIdx.x] = data[threadIdx.x];
   __syncthreads();

   permute_data(smem, n);
   __syncthreads();

   // Write back to GMEM since we can't pass SMEM to children.
   data[threadIdx.x] = smem[threadIdx.x];
   __syncthreads();

   if (threadIdx.x == 0) {
       permute<<< 1, 256, n/2*sizeof(int) >>>(n/2, data);
       permute<<< 1, 256, n/2*sizeof(int) >>>(n/2, data+n/2);
   }
}

void host_launch(int *data) {
    permute<<< 1, 256, 256*sizeof(int) >>>(256, data);
}
9.6.2.1.6.4. Symbol Addresses (CDP1)
9.6.2.1.6.4. 符号地址(CDP1) 

See Symbol Addresses, above, for CDP2 version of document.
请参阅上面的符号地址,以查看文档的 CDP2 版本。

Device-side symbols (i.e., those marked __device__) may be referenced from within a kernel simply via the & operator, as all global-scope device variables are in the kernel’s visible address space. This also applies to __constant__ symbols, although in this case the pointer will reference read-only data.
设备端符号(即那些标记为 __device__ 的符号)可以通过 & 运算符在内核中直接引用,因为所有全局范围的设备变量都在内核的可见地址空间中。这也适用于 __constant__ 符号,尽管在这种情况下指针将引用只读数据。

Given that device-side symbols can be referenced directly, those CUDA runtime APIs which reference symbols (e.g., cudaMemcpyToSymbol() or cudaGetSymbolAddress()) are redundant and hence not supported by the device runtime. Note this implies that constant data cannot be altered from within a running kernel, even ahead of a child kernel launch, as references to __constant__ space are read-only.
鉴于设备端符号可以直接引用,因此那些引用符号的 CUDA 运行时 API(例如 cudaMemcpyToSymbol()cudaGetSymbolAddress() )是多余的,因此不受设备运行时支持。请注意,这意味着常量数据无法在运行中的内核内部被更改,即使在子内核启动之前,对 __constant__ 空间的引用也是只读的。

9.6.2.1.7. API Errors and Launch Failures (CDP1)
9.6.2.1.7. API 错误和启动失败(CDP1) 

See API Errors and Launch Failures, above, for CDP2 version of document.
请参阅上面的 API 错误和启动失败,以获取文档的 CDP2 版本。

As usual for the CUDA runtime, any function may return an error code. The last error code returned is recorded and may be retrieved via the cudaGetLastError() call. Errors are recorded per-thread, so that each thread can identify the most recent error that it has generated. The error code is of type cudaError_t.
与 CUDA 运行时一样,任何函数都可能返回错误代码。返回的最后一个错误代码被记录下来,可以通过 cudaGetLastError() 调用检索。错误是按线程记录的,因此每个线程都可以识别其生成的最近错误。错误代码的类型为 cudaError_t

Similar to a host-side launch, device-side launches may fail for many reasons (invalid arguments, etc). The user must call cudaGetLastError() to determine if a launch generated an error, however lack of an error after launch does not imply the child kernel completed successfully.
与主机端启动类似,设备端启动可能因多种原因(无效参数等)而失败。用户必须调用 cudaGetLastError() 来确定启动是否生成错误,但启动后没有错误并不意味着子内核成功完成。

For device-side exceptions, e.g., access to an invalid address, an error in a child grid will be returned to the host instead of being returned by the parent’s call to cudaDeviceSynchronize().
对于设备端异常,例如访问无效地址,子网格中的错误将返回给主机,而不是由父调用 cudaDeviceSynchronize() 返回。

9.6.2.1.7.1. Launch Setup APIs (CDP1)
9.6.2.1.7.1. 启动设置 API(CDP1) 

See Launch Setup APIs, above, for CDP2 version of document.
请参阅上面的启动设置 API,获取文档的 CDP2 版本。

Kernel launch is a system-level mechanism exposed through the device runtime library, and as such is available directly from PTX via the underlying cudaGetParameterBuffer() and cudaLaunchDevice() APIs. It is permitted for a CUDA application to call these APIs itself, with the same requirements as for PTX. In both cases, the user is then responsible for correctly populating all necessary data structures in the correct format according to specification. Backwards compatibility is guaranteed in these data structures.
内核启动是通过设备运行时库公开的系统级机制,因此可以直接通过底层 cudaGetParameterBuffer()cudaLaunchDevice() API 从 PTX 中使用。CUDA 应用程序可以自行调用这些 API,要求与 PTX 相同。在这两种情况下,用户需要负责根据规范正确填充所有必要的数据结构以正确的格式。这些数据结构保证向后兼容。

As with host-side launch, the device-side operator <<<>>> maps to underlying kernel launch APIs. This is so that users targeting PTX will be able to enact a launch, and so that the compiler front-end can translate <<<>>> into these calls.
与主机端启动一样,设备端操作符 <<<>>> 映射到底层内核启动 API。这样,针对 PTX 的用户就能执行启动操作,编译器前端也能将 <<<>>> 翻译成这些调用。

Table 11 New Device-only Launch Implementation Functions
表 11 新设备专用启动实现函数 

Runtime API Launch Functions
运行时 API 启动函数

Description of Difference From Host Runtime Behaviour (behavior is identical if no description)
与主机运行时行为的差异描述(如果没有描述,则行为相同)

cudaGetParameterBuffer

Generated automatically from <<<>>>. Note different API to host equivalent.
<<<>>> 自动生成。请注意,托管等效的 API 不同。

cudaLaunchDevice

Generated automatically from <<<>>>. Note different API to host equivalent.
<<<>>> 自动生成。请注意,托管等效的 API 不同。

The APIs for these launch functions are different to those of the CUDA Runtime API, and are defined as follows:
这些启动函数的 API 与 CUDA Runtime API 的 API 不同,并定义如下:

extern   device   cudaError_t cudaGetParameterBuffer(void **params);
extern __device__ cudaError_t cudaLaunchDevice(void *kernel,
                                        void *params, dim3 gridDim,
                                        dim3 blockDim,
                                        unsigned int sharedMemSize = 0,
                                        cudaStream_t stream = 0);
9.6.2.1.8. API Reference (CDP1)
9.6.2.1.8. API 参考 (CDP1) 

See API Reference, above, for CDP2 version of document.
请参阅上面的 API 参考,获取文档的 CDP2 版本。

The portions of the CUDA Runtime API supported in the device runtime are detailed here. Host and device runtime APIs have identical syntax; semantics are the same except where indicated. The table below provides an overview of the API relative to the version available from the host.
CUDA Runtime API 支持的部分在设备运行时中详细说明。主机和设备运行时 API 具有相同的语法;语义相同,除非另有说明。下表提供了相对于主机版本可用 API 的概述。

Table 12 Supported API Functions
表 12 支持的 API 函数 

Runtime API Functions 运行时 API 函数

Details 详细信息

cudaDeviceSynchronize

Synchronizes on work launched from thread’s own block only.
仅在线程自己的块中启动的工作上同步。

Warning: Note that calling this API from device code is deprecated in CUDA 11.6, removed for compute_90+ compilation, and is slated for full removal in a future CUDA release.
警告:请注意,在 CUDA 11.6 中从设备代码调用此 API 已被弃用,在 compute_90+编译中已移除,并计划在未来的 CUDA 版本中完全移除。

cudaDeviceGetCacheConfig

cudaDeviceGetLimit

cudaGetLastError

Last error is per-thread state, not per-block state
最后一个错误是每个线程的状态,而不是每个块的状态

cudaPeekAtLastError

cudaGetErrorString

cudaGetDeviceCount

cudaDeviceGetAttribute

Will return attributes for any device
将返回任何设备的属性

cudaGetDevice

Always returns current device ID as would be seen from host
始终返回当前设备 ID,就像从主机看到的那样

cudaStreamCreateWithFlags

Must pass cudaStreamNonBlocking flag 必须传递 cudaStreamNonBlocking 标志

cudaStreamDestroy

cudaStreamWaitEvent

cudaEventCreateWithFlags

Must pass cudaEventDisableTiming flag 必须传递 cudaEventDisableTiming 标志

cudaEventRecord

cudaEventDestroy

cudaFuncGetAttributes

cudaMemcpyAsync

Notes about all memcpy/memset functions:
关于所有 memcpy/memset 函数的注释:

  • Only async memcpy/set functions are supported
    仅支持异步 memcpy/set 函数

  • Only device-to-device memcpy is permitted
    仅允许设备到设备的 memcpy

  • May not pass in local or shared memory pointers
    可能无法传递本地或共享内存指针

cudaMemcpy2DAsync

cudaMemcpy3DAsync

cudaMemsetAsync

cudaMemset2DAsync

cudaMemset3DAsync

cudaRuntimeGetVersion

cudaMalloc

May not call cudaFree on the device on a pointer created on the host, and vice-versa
在设备上不允许在主机上创建的指针上调用 cudaFree ,反之亦然

cudaFree

cudaOccupancyMaxActiveBlocksPerMultiprocessor

cudaOccupancyMaxPotentialBlockSize

cudaOccupancyMaxPotentialBlockSizeVariableSMem

9.6.2.2. Device-side Launch from PTX (CDP1)
9.6.2.2. 从 PTX(CDP1)启动设备端 

See Device-side Launch from PTX, above, for CDP2 version of document.
请参阅上面的 PTX 设备端启动,获取文档的 CDP2 版本。

This section is for the programming language and compiler implementers who target Parallel Thread Execution (PTX) and plan to support Dynamic Parallelism in their language. It provides the low-level details related to supporting kernel launches at the PTX level.
本节面向针对并行线程执行(PTX)并计划支持动态并行性的编程语言和编译器实现者。它提供了支持在 PTX 级别进行内核启动的与低级细节相关的信息。

9.6.2.2.1. Kernel Launch APIs (CDP1)
9.6.2.2.1. 内核启动 API(CDP1) 

See Kernel Launch APIs, above, for CDP2 version of document.
请参阅上面的内核启动 API,获取文档的 CDP2 版本。

Device-side kernel launches can be implemented using the following two APIs accessible from PTX: cudaLaunchDevice() and cudaGetParameterBuffer(). cudaLaunchDevice() launches the specified kernel with the parameter buffer that is obtained by calling cudaGetParameterBuffer() and filled with the parameters to the launched kernel. The parameter buffer can be NULL, i.e., no need to invoke cudaGetParameterBuffer(), if the launched kernel does not take any parameters.
设备端内核启动可以使用从 PTX 访问的以下两个 API 来实现: cudaLaunchDevice()cudaGetParameterBuffer()cudaLaunchDevice() 使用通过调用 cudaGetParameterBuffer() 获得并填充为启动的内核提供参数的参数缓冲区来启动指定的内核。如果启动的内核不需要任何参数,则参数缓冲区可以为 NULL,即无需调用 cudaGetParameterBuffer()

9.6.2.2.1.1. cudaLaunchDevice (CDP1)

See cudaLaunchDevice, above, for CDP2 version of document.
请参阅上面的 cudaLaunchDevice,获取文档的 CDP2 版本。

At the PTX level, cudaLaunchDevice()needs to be declared in one of the two forms shown below before it is used.
在 PTX 级别, cudaLaunchDevice() 需要在使用之前以下面显示的两种形式之一声明。

// PTX-level Declaration of cudaLaunchDevice() when .address_size is 64
.extern .func(.param .b32 func_retval0) cudaLaunchDevice
(
  .param .b64 func,
  .param .b64 parameterBuffer,
  .param .align 4 .b8 gridDimension[12],
  .param .align 4 .b8 blockDimension[12],
  .param .b32 sharedMemSize,
  .param .b64 stream
)
;
// PTX-level Declaration of cudaLaunchDevice() when .address_size is 32
.extern .func(.param .b32 func_retval0) cudaLaunchDevice
(
  .param .b32 func,
  .param .b32 parameterBuffer,
  .param .align 4 .b8 gridDimension[12],
  .param .align 4 .b8 blockDimension[12],
  .param .b32 sharedMemSize,
  .param .b32 stream
)
;

The CUDA-level declaration below is mapped to one of the aforementioned PTX-level declarations and is found in the system header file cuda_device_runtime_api.h. The function is defined in the cudadevrt system library, which must be linked with a program in order to use device-side kernel launch functionality.
下面的 CUDA 级别声明被映射到前面提到的 PTX 级别声明之一,并且可以在系统头文件 cuda_device_runtime_api.h 中找到。该函数在 cudadevrt 系统库中定义,必须与程序链接才能使用设备端内核启动功能。

// CUDA-level declaration of cudaLaunchDevice()
extern "C" __device__
cudaError_t cudaLaunchDevice(void *func, void *parameterBuffer,
                             dim3 gridDimension, dim3 blockDimension,
                             unsigned int sharedMemSize,
                             cudaStream_t stream);

The first parameter is a pointer to the kernel to be is launched, and the second parameter is the parameter buffer that holds the actual parameters to the launched kernel. The layout of the parameter buffer is explained in Parameter Buffer Layout (CDP1), below. Other parameters specify the launch configuration, i.e., as grid dimension, block dimension, shared memory size, and the stream associated with the launch (please refer to Execution Configuration for the detailed description of launch configuration.
第一个参数是指向要启动的内核的指针,第二个参数是保存要启动的内核的实际参数的参数缓冲区。参数缓冲区的布局在下面的参数缓冲区布局(CDP1)中有解释。其他参数指定启动配置,即网格维度、块维度、共享内存大小以及与启动相关联的流(请参阅执行配置以获取启动配置的详细描述)。

9.6.2.2.1.2. cudaGetParameterBuffer (CDP1)

See cudaGetParameterBuffer, above, for CDP2 version of document.
请参阅上面的 cudaGetParameterBuffer,获取文档的 CDP2 版本。

cudaGetParameterBuffer() needs to be declared at the PTX level before it’s used. The PTX-level declaration must be in one of the two forms given below, depending on address size:
cudaGetParameterBuffer() 需要在使用之前在 PTX 级别声明。 PTX 级别声明必须采用以下两种形式之一,具体取决于地址大小:

// PTX-level Declaration of cudaGetParameterBuffer() when .address_size is 64
// When .address_size is 64
.extern .func(.param .b64 func_retval0) cudaGetParameterBuffer
(
  .param .b64 alignment,
  .param .b64 size
)
;
// PTX-level Declaration of cudaGetParameterBuffer() when .address_size is 32
.extern .func(.param .b32 func_retval0) cudaGetParameterBuffer
(
  .param .b32 alignment,
  .param .b32 size
)
;

The following CUDA-level declaration of cudaGetParameterBuffer() is mapped to the aforementioned PTX-level declaration:
以下 CUDA 级别的声明 cudaGetParameterBuffer() 被映射到上述的 PTX 级别声明:

// CUDA-level Declaration of cudaGetParameterBuffer()
extern "C" __device__
void *cudaGetParameterBuffer(size_t alignment, size_t size);

The first parameter specifies the alignment requirement of the parameter buffer and the second parameter the size requirement in bytes. In the current implementation, the parameter buffer returned by cudaGetParameterBuffer() is always guaranteed to be 64- byte aligned, and the alignment requirement parameter is ignored. However, it is recommended to pass the correct alignment requirement value - which is the largest alignment of any parameter to be placed in the parameter buffer - to cudaGetParameterBuffer() to ensure portability in the future.
第一个参数指定参数缓冲区的对齐要求,第二个参数指定字节大小要求。在当前实现中, cudaGetParameterBuffer() 返回的参数缓冲区始终保证是 64 字节对齐的,对齐要求参数会被忽略。然而,建议传递正确的对齐要求值 - 即要放置在参数缓冲区中的任何参数的最大对齐值 - 到 cudaGetParameterBuffer() ,以确保未来的可移植性。

9.6.2.2.2. Parameter Buffer Layout (CDP1)
9.6.2.2.2. 参数缓冲区布局(CDP1) 

See Parameter Buffer Layout, above, for CDP2 version of document.
请参阅上文的参数缓冲区布局,以获取文档的 CDP2 版本。

Parameter reordering in the parameter buffer is prohibited, and each individual parameter placed in the parameter buffer is required to be aligned. That is, each parameter must be placed at the nth byte in the parameter buffer, where n is the smallest multiple of the parameter size that is greater than the offset of the last byte taken by the preceding parameter. The maximum size of the parameter buffer is 4KB.
在参数缓冲区中禁止重新排序参数,并要求将放置在参数缓冲区中的每个单独参数对齐。也就是说,每个参数必须放置在参数缓冲区中的第 n th 字节处,其中 n 是大于前一个参数占用的最后一个字节的偏移量的参数大小的最小倍数。参数缓冲区的最大大小为 4KB。

For a more detailed description of PTX code generated by the CUDA compiler, please refer to the PTX-3.5 specification.
有关 CUDA 编译器生成的 PTX 代码的更详细描述,请参阅 PTX-3.5 规范。

9.6.2.3. Toolkit Support for Dynamic Parallelism (CDP1)
9.6.2.3. 动态并行支持工具包(CDP1) 

See Toolkit Support for Dynamic Parallelism, above, for CDP2 version of document.
请参阅上面的 Toolkit 支持动态并行性,获取文档的 CDP2 版本。

9.6.2.3.1. Including Device Runtime API in CUDA Code (CDP1)
9.6.2.3.1. 在 CUDA 代码中包含设备运行时 API(CDP1) 

See Including Device Runtime API in CUDA Code, above, for CDP2 version of document.
请参阅上面的 CUDA 代码中包含设备运行时 API 的部分,以获取文档的 CDP2 版本。

Similar to the host-side runtime API, prototypes for the CUDA device runtime API are included automatically during program compilation. There is no need to includecuda_device_runtime_api.h explicitly.
与主机端运行时 API 类似,CUDA 设备运行时 API 的原型在程序编译期间自动包含。无需显式包含 cuda_device_runtime_api.h

9.6.2.3.2. Compiling and Linking (CDP1)
9.6.2.3.2. 编译和链接(CDP1) 

See Compiling and Linking, above, for CDP2 version of document.
请参阅上面的编译和链接部分,以获取文档的 CDP2 版本。

When compiling and linking CUDA programs using dynamic parallelism with nvcc, the program will automatically link against the static device runtime library libcudadevrt.
当使用动态并行性编译和链接 CUDA 程序时,程序将自动链接到静态设备运行时库 libcudadevrt

The device runtime is offered as a static library (cudadevrt.lib on Windows, libcudadevrt.a under Linux), against which a GPU application that uses the device runtime must be linked. Linking of device libraries can be accomplished through nvcc and/or nvlink. Two simple examples are shown below.
设备运行时作为静态库提供(在 Windows 上为 cudadevrt.lib ,在 Linux 下为 libcudadevrt.a ),GPU 应用程序必须链接到使用设备运行时的库。设备库的链接可以通过 nvcc 和/或 nvlink 完成。下面显示了两个简单示例。

A device runtime program may be compiled and linked in a single step, if all required source files can be specified from the command line:
设备运行时程序可以在单个步骤中编译和链接,如果所有必需的源文件都可以从命令行中指定:

$ nvcc -arch=sm_75 -rdc=true hello_world.cu -o hello -lcudadevrt

It is also possible to compile CUDA .cu source files first to object files, and then link these together in a two-stage process:
也可以先将 CUDA .cu 源文件编译为目标文件,然后在两阶段过程中将它们链接在一起:

$ nvcc -arch=sm_75 -dc hello_world.cu -o hello_world.o
$ nvcc -arch=sm_75 -rdc=true hello_world.o -o hello -lcudadevrt

Please see the Using Separate Compilation section of The CUDA Driver Compiler NVCC guide for more details.
请查看 CUDA 驱动程序编译器 NVCC 指南的“使用单独编译”部分,以获取更多详细信息。

9.6.3. Programming Guidelines (CDP1)
9.6.3. 编程指南(CDP1) 

See Programming Guidelines, above, for CDP2 version of document.
请参阅上面的编程指南,了解文档的 CDP2 版本。

9.6.3.1. Basics (CDP1)
9.6.3.1. 基础 (CDP1) 

See Basics, above, for CDP2 version of document.
请参阅上面的基础知识,获取文档的 CDP2 版本。

The device runtime is a functional subset of the host runtime. API level device management, kernel launching, device memcpy, stream management, and event management are exposed from the device runtime.
设备运行时是主机运行时的功能子集。设备运行时公开了 API 级设备管理、内核启动、设备内存拷贝、流管理和事件管理。

Programming for the device runtime should be familiar to someone who already has experience with CUDA. Device runtime syntax and semantics are largely the same as that of the host API, with any exceptions detailed earlier in this document.
设备运行时的编程对于已经具有 CUDA 经验的人来说应该是熟悉的。设备运行时的语法和语义与主机 API 基本相同,任何异常情况都在本文档中有详细说明。

Warning 警告

Explicit synchronization with child kernels from a parent block (i.e. using cudaDeviceSynchronize() in device code) is deprecated in CUDA 11.6, removed for compute_90+ compilation, and is slated for full removal in a future CUDA release.
在 CUDA 11.6 中,从父块(即在设备代码中使用 cudaDeviceSynchronize() )显式与子内核同步已被弃用,对于 compute_90+编译已被移除,并计划在未来的 CUDA 版本中完全移除。

The following example shows a simple Hello World program incorporating dynamic parallelism:
以下示例显示了一个简单的 Hello World 程序,其中包含动态并行性:

#include <stdio.h>

__global__ void childKernel()
{
    printf("Hello ");
}

__global__ void parentKernel()
{
    // launch child
    childKernel<<<1,1>>>();
    if (cudaSuccess != cudaGetLastError()) {
        return;
    }

    // wait for child to complete
    if (cudaSuccess != cudaDeviceSynchronize()) {
        return;
    }

    printf("World!\n");
}

int main(int argc, char *argv[])
{
    // launch parent
    parentKernel<<<1,1>>>();
    if (cudaSuccess != cudaGetLastError()) {
        return 1;
    }

    // wait for parent to complete
    if (cudaSuccess != cudaDeviceSynchronize()) {
        return 2;
    }

    return 0;
}

This program may be built in a single step from the command line as follows:
此程序可以通过以下命令行一步构建

$ nvcc -arch=sm_75 -rdc=true hello_world.cu -o hello -lcudadevrt

9.6.3.2. Performance (CDP1)
9.6.3.2. 性能(CDP1) 

See Performance, above, for CDP2 version of document.
请参阅上面的性能部分,了解文档的 CDP2 版本。

9.6.3.2.1. Synchronization (CDP1)
9.6.3.2.1. 同步(CDP1) 

See CUDA Dynamic Parallelism, above, for CDP2 version of document.
请参阅上面的 CUDA 动态并行性,获取文档的 CDP2 版本。

Warning 警告

Explicit synchronization with child kernels from a parent block (such as using cudaDeviceSynchronize() in device code) is deprecated in CUDA 11.6, removed for compute_90+ compilation, and is slated for full removal in a future CUDA release.
在 CUDA 11.6 中,从父块(例如在设备代码中使用 cudaDeviceSynchronize() )显式与子内核同步已被弃用,对于 compute_90+编译已被移除,并计划在未来的 CUDA 版本中完全移除。

Synchronization by one thread may impact the performance of other threads in the same Thread Block, even when those other threads do not call cudaDeviceSynchronize() themselves. This impact will depend upon the underlying implementation. In general the implicit synchronization of child kernels done when a thread block ends is more efficient compared to calling cudaDeviceSynchronize() explicitly. It is therefore recommended to only call cudaDeviceSynchronize() if it is needed to synchronize with a child kernel before a thread block ends.
一个线程的同步可能会影响同一线程块中其他线程的性能,即使这些其他线程本身不调用 cudaDeviceSynchronize() 。这种影响取决于底层实现。通常情况下,当线程块结束时,子内核的隐式同步比显式调用 cudaDeviceSynchronize() 更有效率。因此建议只在需要在线程块结束前与子内核同步时才调用 cudaDeviceSynchronize()

9.6.3.2.2. Dynamic-parallelism-enabled Kernel Overhead (CDP1)
9.6.3.2.2. 动态并行启用的内核开销(CDP1) 

See Dynamic-parallelism-enabled Kernel Overhead, above, for CDP2 version of document.
请参阅上文中的动态并行性内核开销,以获取文档的 CDP2 版本。

System software which is active when controlling dynamic launches may impose an overhead on any kernel which is running at the time, whether or not it invokes kernel launches of its own. This overhead arises from the device runtime’s execution tracking and management software and may result in decreased performance for example, library calls when made from the device compared to from the host side. This overhead is, in general, incurred for applications that link against the device runtime library.
当控制动态启动时处于活动状态的系统软件可能会对任何正在运行的内核施加负担,无论它是否调用自己的内核启动。这种开销源自设备运行时的执行跟踪和管理软件,可能导致性能下降,例如,从设备而非主机端进行库调用时。这种开销通常发生在链接到设备运行时库的应用程序中。

9.6.3.3. Implementation Restrictions and Limitations (CDP1)
9.6.3.3. 实现限制和限制(CDP1) 

See Implementation Restrictions and Limitations, above, for CDP2 version of document.
请参阅上述的实施限制和限制,以获取文档的 CDP2 版本。

Dynamic Parallelism guarantees all semantics described in this document, however, certain hardware and software resources are implementation-dependent and limit the scale, performance and other properties of a program which uses the device runtime.
动态并行性保证了本文档中描述的所有语义,但是某些硬件和软件资源是实现相关的,限制了使用设备运行时的程序的规模、性能和其他属性。

9.6.3.3.1. Runtime (CDP1)
9.6.3.3.1. 运行时(CDP1) 

See Runtime, above, for CDP2 version of document.
请参阅上面的运行时,获取文档的 CDP2 版本。

9.6.3.3.1.1. Memory Footprint (CDP1)
9.6.3.3.1.1. 内存占用 (CDP1) 

See Memory Footprint, above, for CDP2 version of document.
请参阅上文的内存占用情况,以获取文档的 CDP2 版本。

The device runtime system software reserves memory for various management purposes, in particular one reservation which is used for saving parent-grid state during synchronization, and a second reservation for tracking pending grid launches. Configuration controls are available to reduce the size of these reservations in exchange for certain launch limitations. See Configuration Options (CDP1), below, for details.
设备运行时系统软件为各种管理目的保留内存,特别是用于在同步期间保存父网格状态的一个保留,以及用于跟踪待处理网格启动的第二个保留。配置控件可用于减少这些保留的大小,以换取某些启动限制。有关详细信息,请参见下面的配置选项(CDP1)。

The majority of reserved memory is allocated as backing-store for parent kernel state, for use when synchronizing on a child launch. Conservatively, this memory must support storing of state for the maximum number of live threads possible on the device. This means that each parent generation at which cudaDeviceSynchronize() is callable may require up to 860MB of device memory, depending on the device configuration, which will be unavailable for program use even if it is not all consumed.
大多数保留内存被分配为父内核状态的后备存储,用于在子启动上同步时使用。保守地说,这些内存必须支持在设备上可能存在的最大数量的活动线程的状态存储。这意味着每个父代在其中 cudaDeviceSynchronize() 可调用时可能需要高达 860MB 的设备内存,具体取决于设备配置,即使没有完全使用,这些内存也将无法供程序使用。

9.6.3.3.1.2. Nesting and Synchronization Depth (CDP1)
9.6.3.3.1.2. 嵌套和同步深度(CDP1) 

See CUDA Dynamic Parallelism, above, for CDP2 version of document.
请参阅上面的 CUDA 动态并行性,获取文档的 CDP2 版本。

Using the device runtime, one kernel may launch another kernel, and that kernel may launch another, and so on. Each subordinate launch is considered a new nesting level, and the total number of levels is the nesting depth of the program. The synchronization depth is defined as the deepest level at which the program will explicitly synchronize on a child launch. Typically this is one less than the nesting depth of the program, but if the program does not need to call cudaDeviceSynchronize() at all levels then the synchronization depth might be substantially different to the nesting depth.
使用设备运行时,一个内核可以启动另一个内核,而该内核可以启动另一个内核,依此类推。每个下级启动被视为一个新的嵌套级别,总级别数即为程序的嵌套深度。同步深度被定义为程序将在子启动上显式同步的最深级别。通常情况下,这比程序的嵌套深度少一个级别,但如果程序在所有级别上都不需要调用 cudaDeviceSynchronize() ,则同步深度可能与嵌套深度大不相同。

Warning 警告

Explicit synchronization with child kernels from a parent block (i.e. using cudaDeviceSynchronize() in device code) is deprecated in CUDA 11.6, removed for compute_90+ compilation, and is slated for full removal in a future CUDA release.
在 CUDA 11.6 中,从父块(即在设备代码中使用 cudaDeviceSynchronize() )显式与子内核同步已被弃用,对于 compute_90+编译已被移除,并计划在未来的 CUDA 版本中完全移除。

The overall maximum nesting depth is limited to 24, but practically speaking the real limit will be the amount of memory required by the system for each new level (see Memory Footprint (CDP1) above). Any launch which would result in a kernel at a deeper level than the maximum will fail. Note that this may also apply to cudaMemcpyAsync(), which might itself generate a kernel launch. See Configuration Options (CDP1) for details.
总体最大嵌套深度限制为 24,但实际上真正的限制将是系统为每个新级别所需的内存量(请参见上面的内存占用(CDP1))。 任何导致内核位于比最大深度更深级别的启动都将失败。 请注意,这也可能适用于 cudaMemcpyAsync() ,它本身可能会生成一个内核启动。 有关详细信息,请参阅配置选项(CDP1)。

By default, sufficient storage is reserved for two levels of synchronization. This maximum synchronization depth (and hence reserved storage) may be controlled by calling cudaDeviceSetLimit() and specifying cudaLimitDevRuntimeSyncDepth. The number of levels to be supported must be configured before the top-level kernel is launched from the host, in order to guarantee successful execution of a nested program. Calling cudaDeviceSynchronize() at a depth greater than the specified maximum synchronization depth will return an error.
默认情况下,为两个级别的同步保留了足够的存储空间。可以通过调用 cudaDeviceSetLimit() 并指定 cudaLimitDevRuntimeSyncDepth 来控制最大同步深度(因此保留的存储空间)。必须在从主机启动顶层内核之前配置要支持的级别数量,以确保成功执行嵌套程序。在深度大于指定的最大同步深度时调用 cudaDeviceSynchronize() 将返回错误。

An optimization is permitted where the system detects that it need not reserve space for the parent’s state in cases where the parent kernel never calls cudaDeviceSynchronize(). In this case, because explicit parent/child synchronization never occurs, the memory footprint required for a program will be much less than the conservative maximum. Such a program could specify a shallower maximum synchronization depth to avoid over-allocation of backing store.
系统检测到在父内核从不调用 cudaDeviceSynchronize() 的情况下,可以进行优化,无需为父状态保留空间。在这种情况下,由于从不发生显式父/子同步,程序所需的内存占用量将远低于保守的最大值。这样的程序可以指定更浅的最大同步深度,以避免过度分配后备存储。

9.6.3.3.1.3. Pending Kernel Launches (CDP1)
9.6.3.3.1.3. 待处理的内核启动(CDP1) 

See Pending Kernel Launches, above, for CDP2 version of document.
请参阅上面的待处理内核启动,以获取文档的 CDP2 版本。

When a kernel is launched, all associated configuration and parameter data is tracked until the kernel completes. This data is stored within a system-managed launch pool.
当启动内核时,直到内核完成,所有关联的配置和参数数据都将被跟踪。这些数据存储在系统管理的启动池中。

The launch pool is divided into a fixed-size pool and a virtualized pool with lower performance. The device runtime system software will try to track launch data in the fixed-size pool first. The virtualized pool will be used to track new launches when the fixed-size pool is full.
启动池分为固定大小池和性能较低的虚拟化池。设备运行时系统软件将首先尝试在固定大小池中跟踪启动数据。当固定大小池已满时,将使用虚拟化池来跟踪新的启动。

The size of the fixed-size launch pool is configurable by calling cudaDeviceSetLimit() from the host and specifying cudaLimitDevRuntimePendingLaunchCount.
固定大小的启动池大小可通过从主机调用 cudaDeviceSetLimit() 并指定 cudaLimitDevRuntimePendingLaunchCount 进行配置。

9.6.3.3.1.4. Configuration Options (CDP1)
9.6.3.3.1.4. 配置选项(CDP1) 

See Configuration Options, above, for CDP2 version of document.
请参阅上面的配置选项,以获取文档的 CDP2 版本。

Resource allocation for the device runtime system software is controlled via the cudaDeviceSetLimit() API from the host program. Limits must be set before any kernel is launched, and may not be changed while the GPU is actively running programs.
设备运行时系统软件的资源分配是通过主机程序中的 cudaDeviceSetLimit() API 控制的。必须在启动任何内核之前设置限制,并且在 GPU 正在运行程序时不得更改。

Warning 警告

Explicit synchronization with child kernels from a parent block (i.e. using cudaDeviceSynchronize() in device code) is deprecated in CUDA 11.6, removed for compute_90+ compilation, and is slated for full removal in a future CUDA release.
在 CUDA 11.6 中,从父块(即在设备代码中使用 cudaDeviceSynchronize() )显式与子内核同步已被弃用,对于 compute_90+编译已被移除,并计划在未来的 CUDA 版本中完全移除。

The following named limits may be set:
以下命名限制可能被设置:

Limit 限制

Behavior 行为

cudaLimitDevRuntimeSyncDepth

Sets the maximum depth at which cudaDeviceSynchronize() may be called. Launches may be performed deeper than this, but explicit synchronization deeper than this limit will return the cudaErrorLaunchMaxDepthExceeded. The default maximum sync depth is 2.
设置 cudaDeviceSynchronize() 可以被调用的最大深度。可以在此深度之下执行启动,但是超过此限制的显式同步将返回 cudaErrorLaunchMaxDepthExceeded 。默认的最大同步深度为 2。

cudaLimitDevRuntimePendingLaunchCount

Controls the amount of memory set aside for buffering kernel launches which have not yet begun to execute, due either to unresolved dependencies or lack of execution resources. When the buffer is full, the device runtime system software will attempt to track new pending launches in a lower performance virtualized buffer. If the virtualized buffer is also full, i.e. when all available heap space is consumed, launches will not occur, and the thread’s last error will be set to cudaErrorLaunchPendingCountExceeded. The default pending launch count is 2048 launches.
控制为尚未开始执行的内核启动设置的内存量,原因是要么存在未解决的依赖关系,要么缺乏执行资源。当缓冲区已满时,设备运行时系统软件将尝试在性能较低的虚拟缓冲区中跟踪新的待处理启动。如果虚拟缓冲区也已满,即当所有可用堆空间都被消耗时,启动将不会发生,并且线程的最后一个错误将设置为 cudaErrorLaunchPendingCountExceeded 。默认的待处理启动计数为 2048 个启动。

cudaLimitStackSize

Controls the stack size in bytes of each GPU thread. The CUDA driver automatically increases the per-thread stack size for each kernel launch as needed. This size isn’t reset back to the original value after each launch. To set the per-thread stack size to a different value, cudaDeviceSetLimit() can be called to set this limit. The stack will be immediately resized, and if necessary, the device will block until all preceding requested tasks are complete. cudaDeviceGetLimit() can be called to get the current per-thread stack size.
控制每个 GPU 线程的堆栈大小(以字节为单位)。CUDA 驱动程序会根据需要自动增加每个内核启动的每个线程的堆栈大小。此大小在每次启动后不会重置回原始值。要将每个线程的堆栈大小设置为不同的值,可以调用 cudaDeviceSetLimit() 来设置此限制。堆栈将立即调整大小,如有必要,设备将阻塞,直到所有先前请求的任务完成。可以调用 cudaDeviceGetLimit() 来获取当前每个线程的堆栈大小。

9.6.3.3.1.5. Memory Allocation and Lifetime (CDP1)
9.6.3.3.1.5. 内存分配和生命周期(CDP1) 

See Memory Allocation and Lifetime, above, for CDP2 version of document.
请参阅上文中关于 CDP2 版本文档的内存分配和生命周期。

cudaMalloc() and cudaFree() have distinct semantics between the host and device environments. When invoked from the host, cudaMalloc() allocates a new region from unused device memory. When invoked from the device runtime these functions map to device-side malloc() and free(). This implies that within the device environment the total allocatable memory is limited to the device malloc() heap size, which may be smaller than the available unused device memory. Also, it is an error to invoke cudaFree() from the host program on a pointer which was allocated by cudaMalloc() on the device or vice-versa.
cudaMalloc()cudaFree() 在主机和设备环境之间具有不同的语义。当从主机调用时, cudaMalloc() 会从未使用的设备内存中分配一个新区域。当从设备运行时调用这些函数时,这些函数会映射到设备端的 malloc()free() 。这意味着在设备环境中,可分配的总内存受限于设备 malloc() 堆大小,这可能小于可用的未使用设备内存。此外,在主机程序上调用 cudaFree() 对由设备上的 cudaMalloc() 分配的指针或反之则是错误的。

cudaMalloc() on Host  cudaMalloc() 在主机上

cudaMalloc() on Device  cudaMalloc() 在设备上

cudaFree() on Host  cudaFree() 在主机上

Supported 支持的

Not Supported 不支持

cudaFree() on Device  cudaFree() 在设备上

Not Supported 不支持

Supported 支持的

Allocation limit 分配限制

Free device memory 释放设备内存

cudaLimitMallocHeapSize

9.6.3.3.1.6. SM Id and Warp Id (CDP1)
9.6.3.3.1.6. SM Id 和 Warp Id(CDP1) 

See SM Id and Warp Id, above, for CDP2 version of document.
请参阅上面的 SM Id 和 Warp Id,以获取文档的 CDP2 版本。

Note that in PTX %smid and %warpid are defined as volatile values. The device runtime may reschedule thread blocks onto different SMs in order to more efficiently manage resources. As such, it is unsafe to rely upon %smid or %warpid remaining unchanged across the lifetime of a thread or thread block.
请注意,在 PTX 中, %smid%warpid 被定义为易失性值。设备运行时可能会重新调度线程块到不同的 SM 以更有效地管理资源。因此,在线程或线程块的生命周期内依赖于 %smid%warpid 保持不变是不安全的。

9.6.3.3.1.7. ECC Errors (CDP1)
9.6.3.3.1.7. ECC 错误(CDP1) 

See ECC Errors, above, for CDP2 version of document.
请参阅上面的 ECC 错误,以获取文档的 CDP2 版本。

No notification of ECC errors is available to code within a CUDA kernel. ECC errors are reported at the host side once the entire launch tree has completed. Any ECC errors which arise during execution of a nested program will either generate an exception or continue execution (depending upon error and configuration).
CUDA 内核中的代码无法获得 ECC 错误的通知。一旦整个启动树完成,ECC 错误将在主机端报告。在执行嵌套程序期间出现的任何 ECC 错误都将生成异常或继续执行(取决于错误和配置)。

14(1,2,3)

Dynamically created texture and surface objects are an addition to the CUDA memory model introduced with CUDA 5.0. Please see the CUDA Programming Guide for details.
动态创建的纹理和表面对象是 CUDA 5.0 引入的 CUDA 内存模型的补充。有关详细信息,请参阅 CUDA 编程指南。

10. Virtual Memory Management
10. 虚拟内存管理 

10.1. Introduction 10.1. 简介 

The Virtual Memory Management APIs provide a way for the application to directly manage the unified virtual address space that CUDA provides to map physical memory to virtual addresses accessible by the GPU. Introduced in CUDA 10.2, these APIs additionally provide a new way to interop with other processes and graphics APIs like OpenGL and Vulkan, as well as provide newer memory attributes that a user can tune to fit their applications.
虚拟内存管理 API 提供了一种方式,使应用程序能够直接管理 CUDA 提供的统一虚拟地址空间,将物理内存映射到 GPU 可访问的虚拟地址。引入于 CUDA 10.2 中,这些 API 还提供了一种新的与其他进程和图形 API(如 OpenGL 和 Vulkan)进行交互的方式,同时提供了用户可以调整以适应其应用程序的新内存属性。

Historically, memory allocation calls (such as cudaMalloc()) in the CUDA programming model have returned a memory address that points to the GPU memory. The address thus obtained could be used with any CUDA API or inside a device kernel. However, the memory allocated could not be resized depending on the user’s memory needs. In order to increase an allocation’s size, the user had to explicitly allocate a larger buffer, copy data from the initial allocation, free it and then continue to keep track of the newer allocation’s address. This often leads to lower performance and higher peak memory utilization for applications. Essentially, users had a malloc-like interface for allocating GPU memory, but did not have a corresponding realloc to complement it. The Virtual Memory Management APIs decouple the idea of an address and memory and allow the application to handle them separately. The APIs allow applications to map and unmap memory from a virtual address range as they see fit.
在 CUDA 编程模型中,历史上,内存分配调用(例如 cudaMalloc() )返回指向 GPU 内存的内存地址。因此,获得的地址可以与任何 CUDA API 一起使用或在设备内核中使用。然而,分配的内存无法根据用户的内存需求调整大小。为了增加分配的大小,用户必须显式分配一个更大的缓冲区,从初始分配中复制数据,释放它,然后继续跟踪新分配的地址。这经常导致应用程序性能降低和内存利用率增加。基本上,用户具有类似 malloc 的接口来分配 GPU 内存,但没有相应的 realloc 来补充它。虚拟内存管理 API 解耦了地址和内存的概念,并允许应用程序分别处理它们。这些 API 允许应用程序根据需要映射和取消映射虚拟地址范围内的内存。

In the case of enabling peer device access to memory allocations by using cudaEnablePeerAccess, all past and future user allocations are mapped to the target peer device. This lead to users unwittingly paying runtime cost of mapping all cudaMalloc allocations to peer devices. However, in most situations applications communicate by sharing only a few allocations with another device and not all allocations are required to be mapped to all the devices. With Virtual Memory Management, applications can specifically choose certain allocations to be accessible from target devices.
在启用对内存分配的对等设备访问时,通过使用 cudaEnablePeerAccess ,所有过去和未来的用户分配都将映射到目标对等设备。这会导致用户不知不觉地支付将所有 cudaMalloc 分配映射到对等设备的运行时成本。然而,在大多数情况下,应用程序通过仅共享少量分配与另一设备进行通信,并非所有分配都需要映射到所有设备。通过虚拟内存管理,应用程序可以明确选择使某些分配可从目标设备访问。

The CUDA Virtual Memory Management APIs expose fine grained control to the user for managing the GPU memory in applications. It provides APIs that let users:
CUDA 虚拟内存管理 API 公开了对用户在应用程序中管理 GPU 内存的细粒度控制。它提供了让用户:

  • Place memory allocated on different devices into a contiguous VA range.
    将分配在不同设备上的内存放置到连续的虚拟地址范围内。

  • Perform interprocess communication for memory sharing using platform-specific mechanisms.
    使用特定于平台的机制执行内存共享的进程间通信。

  • Opt into newer memory types on the devices that support them.
    在支持它们的设备上选择较新的内存类型。

In order to allocate memory, the Virtual Memory Management programming model exposes the following functionality:
为了分配内存,虚拟内存管理编程模型提供以下功能:

  • Allocating physical memory.
    分配物理内存。

  • Reserving a VA range. 保留 VA 范围。

  • Mapping allocated memory to the VA range.
    将分配的内存映射到 VA 范围。

  • Controlling access rights on the mapped range.
    在映射范围上控制访问权限。

Note that the suite of APIs described in this section require a system that supports UVA.
请注意,本节中描述的一套 API 需要支持 UVA 的系统。

10.2. Query for Support
10.2. 查询支持 

Before attempting to use Virtual Memory Management APIs, applications must ensure that the devices they want to use support CUDA Virtual Memory Management. The following code sample shows querying for Virtual Memory Management support:
在尝试使用虚拟内存管理 API 之前,应用程序必须确保要使用的设备支持 CUDA 虚拟内存管理。以下代码示例显示了查询虚拟内存管理支持:

int deviceSupportsVmm;
CUresult result = cuDeviceGetAttribute(&deviceSupportsVmm, CU_DEVICE_ATTRIBUTE_VIRTUAL_MEMORY_MANAGEMENT_SUPPORTED, device);
if (deviceSupportsVmm != 0) {
    // `device` supports Virtual Memory Management
}

10.3. Allocating Physical Memory
10.3. 分配物理内存 

The first step in memory allocation using Virtual Memory Management APIs is to create a physical memory chunk that will provide a backing for the allocation. In order to allocate physical memory, applications must use the cuMemCreate API. The allocation created by this function does not have any device or host mappings. The function argument CUmemGenericAllocationHandle describes the properties of the memory to allocate such as the location of the allocation, if the allocation is going to be shared to another process (or other Graphics APIs), or the physical attributes of the memory to be allocated. Users must ensure the requested allocation’s size must be aligned to appropriate granularity. Information regarding an allocation’s granularity requirements can be queried using cuMemGetAllocationGranularity. The following code snippet shows allocating physical memory with cuMemCreate:
使用虚拟内存管理 API 进行内存分配的第一步是创建一个物理内存块,该内存块将为分配提供支持。为了分配物理内存,应用程序必须使用 cuMemCreate API。此函数创建的分配没有任何设备或主机映射。函数参数 CUmemGenericAllocationHandle 描述要分配的内存的属性,例如分配的位置,分配是否将共享给另一个进程(或其他图形 API),或要分配的内存的物理属性。用户必须确保请求的分配大小必须对齐到适当的粒度。可以使用 cuMemGetAllocationGranularity 查询有关分配粒度要求的信息。以下代码片段显示了使用 cuMemCreate 分配物理内存:

CUmemGenericAllocationHandle allocatePhysicalMemory(int device, size_t size) {
    CUmemAllocationProp prop = {};
    prop.type = CU_MEM_ALLOCATION_TYPE_PINNED;
    prop.location.type = CU_MEM_LOCATION_TYPE_DEVICE;
    prop.location.id = device;
    cuMemGetAllocationGranularity(&granularity, &prop, CU_MEM_ALLOC_GRANULARITY_MINIMUM);

    // Ensure size matches granularity requirements for the allocation
    size_t padded_size = ROUND_UP(size, granularity);

    // Allocate physical memory
    CUmemGenericAllocationHandle allocHandle;
    cuMemCreate(&allocHandle, padded_size, &prop, 0);

    return allocHandle;
}

The memory allocated by cuMemCreate is referenced by the CUmemGenericAllocationHandle it returns. This is a departure from the cudaMalloc-style of allocation, which returns a pointer to the GPU memory, which was directly accessible by CUDA kernel executing on the device. The memory allocated cannot be used for any operations other than querying properties using cuMemGetAllocationPropertiesFromHandle. In order to make this memory accessible, applications must map this memory into a VA range reserved by cuMemAddressReserve and provide suitable access rights to it. Applications must free the allocated memory using the cuMemRelease API.
cuMemCreate 分配的内存由 CUmemGenericAllocationHandle 引用。这与 cudaMalloc 风格的分配方式不同,后者返回指向 GPU 内存的指针,CUDA 核心在设备上执行时可以直接访问该内存。分配的内存不能用于除使用 cuMemGetAllocationPropertiesFromHandle 查询属性之外的任何操作。为了使该内存可访问,应用程序必须将该内存映射到 cuMemAddressReserve 保留的 VA 范围,并为其提供适当的访问权限。应用程序必须使用 cuMemRelease API 释放已分配的内存。

10.3.1. Shareable Memory Allocations
10.3.1. 可共享内存分配 

With cuMemCreate users now have the facility to indicate to CUDA, at allocation time, that they have earmarked a particular allocation for Inter process communication and graphics interop purposes. Applications can do this by setting CUmemAllocationProp::requestedHandleTypes to a platform-specific field. On Windows, when CUmemAllocationProp::requestedHandleTypes is set to CU_MEM_HANDLE_TYPE_WIN32 applications must also specify an LPSECURITYATTRIBUTES attribute in CUmemAllocationProp::win32HandleMetaData. This security attribute defines the scope of which exported allocations may be transferred to other processes.
现在,用户可以在分配时向 CUDA 指示,他们已经为进程间通信和图形交互目的预留了特定的分配。应用程序可以通过将 CUmemAllocationProp::requestedHandleTypes 设置为特定于平台的字段来实现此目的。在 Windows 上,当 CUmemAllocationProp::requestedHandleTypes 设置为 CU_MEM_HANDLE_TYPE_WIN32 时,应用程序还必须在 CUmemAllocationProp::win32HandleMetaData 中指定 LPSECURITYATTRIBUTES 属性。此安全属性定义了可以传输到其他进程的导出分配的范围。

The CUDA Virtual Memory Management API functions do not support the legacy interprocess communication functions with their memory. Instead, they expose a new mechanism for interprocess communication that uses OS-specific handles. Applications can obtain these OS-specific handles corresponding to the allocations by using cuMemExportToShareableHandle. The handles thus obtained can be transferred by using the usual OS native mechanisms for inter process communication. The recipient process should import the allocation by using cuMemImportFromShareableHandle.
CUDA 虚拟内存管理 API 函数不支持使用其内存的传统进程间通信函数。相反,它们公开了一种使用特定于操作系统的句柄进行进程间通信的新机制。应用程序可以通过使用 cuMemExportToShareableHandle 获得与分配对应的这些特定于操作系统的句柄。因此获得的句柄可以通过使用通常的操作系统本机机制进行进程间通信传输。接收进程应该通过使用 cuMemImportFromShareableHandle 导入分配。

Users must ensure they query for support of the requested handle type before attempting to export memory allocated with cuMemCreate. The following code snippet illustrates query for handle type support in a platform-specific way.
用户必须确保在尝试导出使用 cuMemCreate 分配的内存之前,查询所请求句柄类型的支持。以下代码片段以特定于平台的方式说明了查询句柄类型支持的方法。

int deviceSupportsIpcHandle;
#if defined(__linux__)
    cuDeviceGetAttribute(&deviceSupportsIpcHandle, CU_DEVICE_ATTRIBUTE_HANDLE_TYPE_POSIX_FILE_DESCRIPTOR_SUPPORTED, device));
#else
    cuDeviceGetAttribute(&deviceSupportsIpcHandle, CU_DEVICE_ATTRIBUTE_HANDLE_TYPE_WIN32_HANDLE_SUPPORTED, device));
#endif

Users should set the CUmemAllocationProp::requestedHandleTypes appropriately as shown below:
用户应根据以下所示适当设置 CUmemAllocationProp::requestedHandleTypes

#if defined(__linux__)
    prop.requestedHandleTypes = CU_MEM_HANDLE_TYPE_POSIX_FILE_DESCRIPTOR;
#else
    prop.requestedHandleTypes = CU_MEM_HANDLE_TYPE_WIN32;
    prop.win32HandleMetaData = // Windows specific LPSECURITYATTRIBUTES attribute.
#endif

The memMapIpcDrv sample can be used as an example for using IPC with Virtual Memory Management allocations.
memMapIpcDrv 示例可用作使用 IPC 与虚拟内存管理分配的示例。

10.3.2. Memory Type
10.3.2. 内存类型 

Before CUDA 10.2, applications had no user-controlled way of allocating any special type of memory that certain devices may support. With cuMemCreate, applications can additionally specify memory type requirements using the CUmemAllocationProp::allocFlags to opt into any specific memory features. Applications must also ensure that the requested memory type is supported on the device of allocation.
在 CUDA 10.2 之前,应用程序无法通过用户控制的方式分配某些设备可能支持的特殊类型的内存。使用 cuMemCreate ,应用程序可以使用 CUmemAllocationProp::allocFlags 指定内存类型要求,以选择任何特定的内存特性。应用程序还必须确保请求的内存类型在分配设备上得到支持。

10.3.2.1. Compressible Memory
10.3.2.1. 可压缩内存 

Compressible memory can be used to accelerate accesses to data with unstructured sparsity and other compressible data patterns. Compression can save DRAM bandwidth, L2 read bandwidth and L2 capacity depending on the data being operated on. Applications that want to allocate compressible memory on devices that support Compute Data Compression can do so by setting CUmemAllocationProp::allocFlags::compressionType to CU_MEM_ALLOCATION_COMP_GENERIC. Users must query if device supports Compute Data Compression by using CU_DEVICE_ATTRIBUTE_GENERIC_COMPRESSION_SUPPORTED. The following code snippet illustrates querying compressible memory support cuDeviceGetAttribute.
可压缩内存可用于加速对具有非结构化稀疏性和其他可压缩数据模式的数据的访问。根据正在操作的数据,压缩可以节省 DRAM 带宽、L2 读取带宽和 L2 容量。希望在支持计算数据压缩的设备上分配可压缩内存的应用程序可以通过将 CUmemAllocationProp::allocFlags::compressionType 设置为 CU_MEM_ALLOCATION_COMP_GENERIC 来实现。用户必须使用 CU_DEVICE_ATTRIBUTE_GENERIC_COMPRESSION_SUPPORTED 查询设备是否支持计算数据压缩。以下代码片段说明了查询可压缩内存支持 cuDeviceGetAttribute

int compressionSupported = 0;
cuDeviceGetAttribute(&compressionSupported, CU_DEVICE_ATTRIBUTE_GENERIC_COMPRESSION_SUPPORTED, device);

On devices that support Compute Data Compression, users must opt in at allocation time as shown below:
在支持计算数据压缩的设备上,用户必须在分配时间选择加入,如下所示:

prop.allocFlags.compressionType = CU_MEM_ALLOCATION_COMP_GENERIC;

Due to various reasons such as limited HW resources, the allocation may not have compression attributes, the user is expected to query back the properties of the allocated memory using cuMemGetAllocationPropertiesFromHandle and check for compression attribute.
由于各种原因,如有限的硬件资源,分配可能没有压缩属性,用户应该使用 cuMemGetAllocationPropertiesFromHandle 查询分配的内存属性,并检查压缩属性。

CUmemAllocationPropPrivate allocationProp = {};
cuMemGetAllocationPropertiesFromHandle(&allocationProp, allocationHandle);

if (allocationProp.allocFlags.compressionType == CU_MEM_ALLOCATION_COMP_GENERIC)
{
    // Obtained compressible memory allocation
}

10.4. Reserving a Virtual Address Range
10.4. 保留虚拟地址范围 

Since with Virtual Memory Management the notions of address and memory are distinct, applications must carve out an address range that can hold the memory allocations made by cuMemCreate. The address range reserved must be at least as large as the sum of the sizes of all the physical memory allocations the user plans to place in them.
由于虚拟内存管理中地址和内存的概念是不同的,应用程序必须划分出一个地址范围,可以容纳 cuMemCreate 所做的内存分配。保留的地址范围必须至少与用户计划放置在其中的所有物理内存分配的大小之和一样大。

Applications can reserve a virtual address range by passing appropriate parameters to cuMemAddressReserve. The address range obtained will not have any device or host physical memory associated with it. The reserved virtual address range can be mapped to memory chunks belonging to any device in the system, thus providing the application a continuous VA range backed and mapped by memory belonging to different devices. Applications are expected to return the virtual address range back to CUDA using cuMemAddressFree. Users must ensure that the entire VA range is unmapped before calling cuMemAddressFree. These functions are conceptually similar to mmap/munmap (on Linux) or VirtualAlloc/VirtualFree (on Windows) functions. The following code snippet illustrates the usage for the function:
应用程序可以通过向 cuMemAddressReserve 传递适当的参数来保留虚拟地址范围。获得的地址范围不会与任何设备或主机物理内存相关联。保留的虚拟地址范围可以映射到系统中任何设备的内存块,从而为应用程序提供由属于不同设备的内存支持和映射的连续 VA 范围。应用程序应该使用 cuMemAddressFree 将虚拟地址范围返回给 CUDA。用户必须确保在调用 cuMemAddressFree 之前取消映射整个 VA 范围。这些函数在概念上类似于 mmap/munmap(在 Linux 上)或 VirtualAlloc/VirtualFree(在 Windows 上)函数。以下代码片段说明了该函数的用法:

CUdeviceptr ptr;
// `ptr` holds the returned start of virtual address range reserved.
CUresult result = cuMemAddressReserve(&ptr, size, 0, 0, 0); // alignment = 0 for default alignment

10.5. Virtual Aliasing Support
10.5. 虚拟别名支持 

The Virtual Memory Management APIs provide a way to create multiple virtual memory mappings or “proxies” to the same allocation using multiple calls to cuMemMap with different virtual addresses, so-called virtual aliasing. Unless otherwise noted in the PTX ISA, writes to one proxy of the allocation are considered inconsistent and incoherent with any other proxy of the same memory until the writing device operation (grid launch, memcpy, memset, and so on) completes. Grids present on the GPU prior to a writing device operation but reading after the writing device operation completes are also considered to have inconsistent and incoherent proxies.
虚拟内存管理 API 提供了一种方法,可以使用多次对 cuMemMap 的调用创建多个虚拟内存映射或“代理”,这被称为虚拟别名。除非在 PTX ISA 中另有说明,否则对分配的一个代理的写入被视为与同一内存的任何其他代理不一致和不连贯,直到写入设备操作(网格启动、memcpy、memset 等)完成为止。在写入设备操作完成之前存在于 GPU 上的网格,在写入设备操作完成后进行读取时,也被视为具有不一致和不连贯的代理。

For example, the following snippet is considered undefined, assuming device pointers A and B are virtual aliases of the same memory allocation:
例如,假设设备指针 A 和 B 是相同内存分配的虚拟别名,则以下代码片段被视为未定义:

__global__ void foo(char *A, char *B) {
  *A = 0x1;
  printf("%d\n", *B);    // Undefined behavior!  *B can take on either
// the previous value or some value in-between.
}

The following is defined behavior, assuming these two kernels are ordered monotonically (by streams or events).
以下是定义行为,假设这两个内核是单调有序的(按流或事件)。

__global__ void foo1(char *A) {
  *A = 0x1;
}

__global__ void foo2(char *B) {
  printf("%d\n", *B);    // *B == *A == 0x1 assuming foo2 waits for foo1
// to complete before launching
}

cudaMemcpyAsync(B, input, size, stream1);    // Aliases are allowed at
// operation boundaries
foo1<<<1,1,0,stream1>>>(A);                  // allowing foo1 to access A.
cudaEventRecord(event, stream1);
cudaStreamWaitEvent(stream2, event);
foo2<<<1,1,0,stream2>>>(B);
cudaStreamWaitEvent(stream3, event);
cudaMemcpyAsync(output, B, size, stream3);  // Both launches of foo2 and
                                            // cudaMemcpy (which both
                                            // read) wait for foo1 (which writes)
                                            // to complete before proceeding

10.6. Mapping Memory
10.6. 映射内存 

The allocated physical memory and the carved out virtual address space from the previous two sections represent the memory and address distinction introduced by the Virtual Memory Management APIs. For the allocated memory to be useable, the user must first place the memory in the address space. The address range obtained from cuMemAddressReserve and the physical allocation obtained from cuMemCreate or cuMemImportFromShareableHandle must be associated with each other by using cuMemMap.
分配的物理内存和前两节中划分出的虚拟地址空间代表了虚拟内存管理 API 引入的内存和地址区别。要使分配的内存可用,用户必须首先将内存放置在地址空间中。从 cuMemAddressReserve 获得的地址范围和从 cuMemCreatecuMemImportFromShareableHandle 获得的物理分配必须通过使用 cuMemMap 相互关联。

Users can associate allocations from multiple devices to reside in contiguous virtual address ranges as long as they have carved out enough address space. In order to decouple the physical allocation and the address range, users must unmap the address of the mapping by using cuMemUnmap. Users can map and unmap memory to the same address range as many times as they want, as long as they ensure that they don’t attempt to create mappings on VA range reservations that are already mapped. The following code snippet illustrates the usage for the function:
用户可以将来自多个设备的分配关联到连续的虚拟地址范围中,只要它们已经划分出足够的地址空间。为了解耦物理分配和地址范围,用户必须使用 cuMemUnmap 取消映射的地址。用户可以将内存映射和取消映射到相同的地址范围多次,只要确保他们不尝试在已经映射的 VA 范围保留上创建映射。以下代码片段说明了该函数的用法:

CUdeviceptr ptr;
// `ptr`: address in the address range previously reserved by cuMemAddressReserve.
// `allocHandle`: CUmemGenericAllocationHandle obtained by a previous call to cuMemCreate.
CUresult result = cuMemMap(ptr, size, 0, allocHandle, 0);

10.7. Controlling Access Rights
10.7. 控制访问权限 

The Virtual Memory Management APIs enable applications to explicitly protect their VA ranges with access control mechanisms. Mapping the allocation to a region of the address range using cuMemMap does not make the address accessible, and would result in a program crash if accessed by a CUDA kernel. Users must specifically select access control using the cuMemSetAccess function, which allows or restricts access for specific devices to a mapped address range. The following code snippet illustrates the usage for the function:
虚拟内存管理 API 允许应用程序使用访问控制机制显式保护其 VA 范围。使用 cuMemMap 将分配映射到地址范围的区域不会使地址可访问,并且如果由 CUDA 内核访问,将导致程序崩溃。用户必须使用 cuMemSetAccess 函数明确选择访问控制,该函数允许或限制特定设备对映射地址范围的访问。以下代码片段说明了该函数的用法:

void setAccessOnDevice(int device, CUdeviceptr ptr, size_t size) {
    CUmemAccessDesc accessDesc = {};
    accessDesc.location.type = CU_MEM_LOCATION_TYPE_DEVICE;
    accessDesc.location.id = device;
    accessDesc.flags = CU_MEM_ACCESS_FLAGS_PROT_READWRITE;

    // Make the address accessible
    cuMemSetAccess(ptr, size, &accessDesc, 1);
}

The access control mechanism exposed with Virtual Memory Management allows users to be explicit about which allocations they want to share with other peer devices on the system. As specified earlier, cudaEnablePeerAccess forces all prior and future cudaMalloc’d allocations to be mapped to the target peer device. This can be convenient in many cases as user doesn’t have to worry about tracking the mapping state of every allocation to every device in the system. But for users concerned with performance of their applications this approach has performance implications. With access control at allocation granularity Virtual Memory Management exposes a mechanism to have peer mappings with minimal overhead.
使用虚拟内存管理暴露的访问控制机制允许用户明确指定他们希望与系统上的其他对等设备共享的分配。如前所述, cudaEnablePeerAccess 强制将所有先前和将来的 cudaMalloc 分配映射到目标对等设备。在许多情况下,这对用户非常方便,因为用户不必担心跟踪系统中每个分配到每个设备的映射状态。但对于关注其应用程序性能的用户来说,这种方法会影响性能。通过在分配粒度上进行访问控制,虚拟内存管理公开了一种具有最小开销的对等映射机制。

The vectorAddMMAP sample can be used as an example for using the Virtual Memory Management APIs.
vectorAddMMAP 示例可用作使用虚拟内存管理 API 的示例。

11. Stream Ordered Memory Allocator
11. 流有序内存分配器 

11.1. Introduction 11.1. 简介 

Managing memory allocations using cudaMalloc and cudaFree causes GPU to synchronize across all executing CUDA streams. The Stream Order Memory Allocator enables applications to order memory allocation and deallocation with other work launched into a CUDA stream such as kernel launches and asynchronous copies. This improves application memory use by taking advantage of stream-ordering semantics to reuse memory allocations. The allocator also allows applications to control the allocator’s memory caching behavior. When set up with an appropriate release threshold, the caching behavior allows the allocator to avoid expensive calls into the OS when the application indicates it is willing to accept a bigger memory footprint. The allocator also supports the easy and secure sharing of allocations between processes.
使用 cudaMalloccudaFree 管理内存分配会导致 GPU 在所有执行的 CUDA 流之间同步。流顺序内存分配器使应用程序能够对内存分配和释放与其他启动到 CUDA 流中的工作(如内核启动和异步复制)进行排序。通过利用流排序语义来重用内存分配,从而改善应用程序的内存使用情况。该分配器还允许应用程序控制分配器的内存缓存行为。当设置适当的释放阈值时,缓存行为使分配器能够避免在应用程序指示愿意接受更大内存占用时向操作系统发出昂贵的调用。该分配器还支持在进程之间轻松安全地共享分配。

For many applications, the Stream Ordered Memory Allocator reduces the need for custom memory management abstractions, and makes it easier to create high-performance custom memory management for applications that need it. For applications and libraries that already have custom memory allocators, adopting the Stream Ordered Memory Allocator enables multiple libraries to share a common pool of memory managed by the driver, thus reducing excess memory consumption. Additionally, the driver can perform optimizations based on its awareness of the allocator and other stream management APIs. Finally, Nsight Compute and the Next-Gen CUDA debugger is aware of the allocator as part of their CUDA 11.3 toolkit support.
对于许多应用程序来说,流有序内存分配器减少了对自定义内存管理抽象的需求,并使为需要的应用程序创建高性能自定义内存管理变得更加容易。对于已经具有自定义内存分配器的应用程序和库,采用流有序内存分配器使多个库能够共享由驱动程序管理的共享内存池,从而减少多余的内存消耗。此外,驱动程序可以根据其对分配器和其他流管理 API 的了解执行优化。最后,Nsight Compute 和 Next-Gen CUDA 调试器在其 CUDA 11.3 工具包支持中意识到了分配器的存在。

11.2. Query for Support
11.2. 查询支持 

The user can determine whether or not a device supports the stream ordered memory allocator by calling cudaDeviceGetAttribute() with the device attribute cudaDevAttrMemoryPoolsSupported.
用户可以通过调用 cudaDeviceGetAttribute() 与设备属性 cudaDevAttrMemoryPoolsSupported 来确定设备是否支持流有序内存分配器。

Starting with CUDA 11.3, IPC memory pool support can be queried with the cudaDevAttrMemoryPoolSupportedHandleTypes device attribute. Previous drivers will return cudaErrorInvalidValue as those drivers are unaware of the attribute enum.
从 CUDA 11.3 开始,IPC 内存池支持可以通过 cudaDevAttrMemoryPoolSupportedHandleTypes 设备属性查询。之前的驱动程序将返回 cudaErrorInvalidValue ,因为这些驱动程序不知道该属性枚举。

int driverVersion = 0;
int deviceSupportsMemoryPools = 0;
int poolSupportedHandleTypes = 0;
cudaDriverGetVersion(&driverVersion);
if (driverVersion >= 11020) {
    cudaDeviceGetAttribute(&deviceSupportsMemoryPools,
                           cudaDevAttrMemoryPoolsSupported, device);
}
if (deviceSupportsMemoryPools != 0) {
    // `device` supports the Stream Ordered Memory Allocator
}

if (driverVersion >= 11030) {
    cudaDeviceGetAttribute(&poolSupportedHandleTypes,
              cudaDevAttrMemoryPoolSupportedHandleTypes, device);
}
if (poolSupportedHandleTypes & cudaMemHandleTypePosixFileDescriptor) {
   // Pools on the specified device can be created with posix file descriptor-based IPC
}

Performing the driver version check before the query avoids hitting a cudaErrorInvalidValue error on drivers where the attribute was not yet defined. One can use cudaGetLastError to clear the error instead of avoiding it.
在查询之前执行驱动程序版本检查,可以避免在尚未定义属性的驱动程序上遇到 cudaErrorInvalidValue 错误。可以使用 cudaGetLastError 来清除错误,而不是避免它。

11.3. API Fundamentals (cudaMallocAsync and cudaFreeAsync)
11.3. API 基础知识(cudaMallocAsync 和 cudaFreeAsync) 

The APIs cudaMallocAsync and cudaFreeAsync form the core of the allocator. cudaMallocAsync returns an allocation and cudaFreeAsync frees an allocation. Both APIs accept stream arguments to define when the allocation will become and stop being available for use. The pointer value returned by cudaMallocAsync is determined synchronously and is available for constructing future work. It is important to note that cudaMallocAsync ignores the current device/context when determining where the allocation will reside. Instead, cudaMallocAsync determines the resident device based on the specified memory pool or the supplied stream. The simplest use pattern is when the memory is allocated, used, and freed back into the same stream.
API cudaMallocAsynccudaFreeAsync 构成了分配器的核心。 cudaMallocAsync 返回一个分配, cudaFreeAsync 释放一个分配。这两个 API 都接受流参数,用于定义分配何时可用和停止可用。 cudaMallocAsync 返回的指针值是同步确定的,并可用于构建未来的工作。重要的是要注意, cudaMallocAsync 在确定分配所在位置时会忽略当前设备/上下文。相反, cudaMallocAsync 会根据指定的内存池或提供的流确定所在设备。最简单的使用模式是将内存分配、使用和释放回到同一流中。

void *ptr;
size_t size = 512;
cudaMallocAsync(&ptr, size, cudaStreamPerThread);
// do work using the allocation
kernel<<<..., cudaStreamPerThread>>>(ptr, ...);
// An asynchronous free can be specified without synchronizing the cpu and GPU
cudaFreeAsync(ptr, cudaStreamPerThread);

When using an allocation in a stream other than the allocating stream, the user must guarantee that the access will happen after the allocation operation, otherwise the behavior is undefined. The user may make this guarantee either by synchronizing the allocating stream, or by using CUDA events to synchronize the producing and consuming streams.
当在分配流之外的流中使用分配时,用户必须保证访问发生在分配操作之后,否则行为是未定义的。用户可以通过同步分配流或使用 CUDA 事件来同步生成和消费流来提供此保证。

cudaFreeAsync() inserts a free operation into the stream. The user must guarantee that the free operation happens after the allocation operation and any use of the allocation. Also, any use of the allocation after the free operation starts results in undefined behavior. Events and/or stream synchronizing operations should be used to guarantee any access to the allocation on other streams is complete before the freeing stream begins the free operation.
cudaFreeAsync() 将一个自由操作插入到流中。用户必须确保自由操作发生在分配操作之后以及分配的任何使用之后。此外,在自由操作开始后对分配的任何使用会导致未定义的行为。应使用事件和/或流同步操作来确保在释放流开始自由操作之前,对其他流上的分配的任何访问都已完成。

cudaMallocAsync(&ptr, size, stream1);
cudaEventRecord(event1, stream1);
//stream2 must wait for the allocation to be ready before accessing
cudaStreamWaitEvent(stream2, event1);
kernel<<<..., stream2>>>(ptr, ...);
cudaEventRecord(event2, stream2);
// stream3 must wait for stream2 to finish accessing the allocation before
// freeing the allocation
cudaStreamWaitEvent(stream3, event2);
cudaFreeAsync(ptr, stream3);

The user can free allocations allocated with cudaMalloc() with cudaFreeAsync(). The user must make the same guarantees about accesses being complete before the free operation begins.
用户可以使用 cudaFreeAsync() 释放分配的 cudaMalloc() 。用户在释放操作开始之前必须做出相同的访问完整性保证。

cudaMalloc(&ptr, size);
kernel<<<..., stream>>>(ptr, ...);
cudaFreeAsync(ptr, stream);

The user can free memory allocated with cudaMallocAsync with cudaFree(). When freeing such allocations through the cudaFree() API, the driver assumes that all accesses to the allocation are complete and performs no further synchronization. The user can use cudaStreamQuery / cudaStreamSynchronize / cudaEventQuery / cudaEventSynchronize / cudaDeviceSynchronize to guarantee that the appropriate asynchronous work is complete and that the GPU will not try to access the allocation.
用户可以使用 cudaFree() 释放分配的 cudaMallocAsync 内存。通过 cudaFree() API 释放这样的分配时,驱动程序假定所有对分配的访问都已完成,并且不再执行进一步的同步。用户可以使用 cudaStreamQuery / cudaStreamSynchronize / cudaEventQuery / cudaEventSynchronize / cudaDeviceSynchronize 来确保适当的异步工作已完成,并且 GPU 不会尝试访问该分配。

cudaMallocAsync(&ptr, size,stream);
kernel<<<..., stream>>>(ptr, ...);
// synchronize is needed to avoid prematurely freeing the memory
cudaStreamSynchronize(stream);
cudaFree(ptr);

11.4. Memory Pools and the cudaMemPool_t
11.4. 内存池和 cudaMemPool_t 

Memory pools encapsulate virtual address and physical memory resources that are allocated and managed according to the pools attributes and properties. The primary aspect of a memory pool is the kind and location of memory it manages.
内存池封装了根据池的属性和特性分配和管理的虚拟地址和物理内存资源。内存池的主要方面是它管理的内存的类型和位置。

All calls to cudaMallocAsync use the resources of a memory pool. In the absence of a specified memory pool, cudaMallocAsync uses the current memory pool of the supplied stream’s device. The current memory pool for a device may be set with cudaDeviceSetMempool and queried with cudaDeviceGetMempool. By default (in the absence of a cudaDeviceSetMempool call), the current memory pool is the default memory pool of a device. The API cudaMallocFromPoolAsync and c++ overloads of cudaMallocAsync allow a user to specify the pool to be used for an allocation without setting it as the current pool. The APIs cudaDeviceGetDefaultMempool and cudaMemPoolCreate give users handles to memory pools.
所有对 cudaMallocAsync 的调用都使用内存池的资源。在没有指定内存池的情况下, cudaMallocAsync 会使用提供的流设备的当前内存池。设备的当前内存池可以使用 cudaDeviceSetMempool 进行设置,并使用 cudaDeviceGetMempool 进行查询。默认情况下(在没有 cudaDeviceSetMempool 调用的情况下),当前内存池是设备的默认内存池。API cudaMallocFromPoolAsync 和 cudaMallocAsync 的 c++重载允许用户指定用于分配的内存池,而无需将其设置为当前内存池。API cudaDeviceGetDefaultMempoolcudaMemPoolCreate 为用户提供内存池的句柄。

Note 注意

The mempool current to a device will be local to that device. So allocating without specifying a memory pool will always yield an allocation local to the stream’s device.
设备的内存池当前将局限于该设备。因此,在不指定内存池的情况下分配将始终产生与流设备本地相关的分配。

Note 注意

cudaMemPoolSetAttribute and cudaMemPoolGetAttribute control the attributes of the memory pools.
cudaMemPoolSetAttributecudaMemPoolGetAttribute 控制内存池的属性。

11.5. Default/Implicit Pools
11.5. 默认/隐式池 

The default memory pool of a device may be retrieved with the cudaDeviceGetDefaultMempool API. Allocations from the default memory pool of a device are non-migratable device allocation located on that device. These allocations will always be accessible from that device. The accessibility of the default memory pool may be modified with cudaMemPoolSetAccess and queried by cudaMemPoolGetAccess. Since the default pools do not need to be explicitly created, they are sometimes referred to as implicit pools. The default memory pool of a device does not support IPC.
设备的默认内存池可以使用 cudaDeviceGetDefaultMempool API 检索。从设备的默认内存池分配的内存是位于该设备上的不可迁移的设备分配。这些分配将始终可以从该设备访问。默认内存池的可访问性可以使用 cudaMemPoolSetAccess 进行修改,并可以通过 cudaMemPoolGetAccess 进行查询。由于默认池不需要显式创建,因此有时被称为隐式池。设备的默认内存池不支持 IPC。

11.6. Explicit Pools
11.6. 显式池 

The API cudaMemPoolCreate creates an explicit pool. This allows applications to request properties for their allocation beyond what is provided by the default/implict pools. These include properties such as IPC capability, maximum pool size, allocations resident on a specific CPU NUMA node on supported platforms etc.
API cudaMemPoolCreate 创建一个显式池。这允许应用程序请求超出默认/隐式池提供的分配属性。这些属性包括 IPC 能力、最大池大小、在支持的平台上特定 CPU NUMA 节点上驻留的分配等。

// create a pool similar to the implicit pool on device 0
int device = 0;
cudaMemPoolProps poolProps = { };
poolProps.allocType = cudaMemAllocationTypePinned;
poolProps.location.id = device;
poolProps.location.type = cudaMemLocationTypeDevice;

cudaMemPoolCreate(&memPool, &poolProps));

The following code snippet illustrates an example of creating an IPC capable memory pool on a valid CPU NUMA node.
以下代码片段演示了在有效的 CPU NUMA 节点上创建一个 IPC 可用的内存池的示例。

// create a pool resident on a CPU NUMA node that is capable of IPC sharing (via a file descriptor).
int cpu_numa_id = 0;
cudaMemPoolProps poolProps = { };
poolProps.allocType = cudaMemAllocationTypePinned;
poolProps.location.id = cpu_numa_id;
poolProps.location.type = cudaMemLocationTypeHostNuma;
poolProps.handleType = cudaMemHandleTypePosixFileDescriptor;

cudaMemPoolCreate(&ipcMemPool, &poolProps));

11.7. Physical Page Caching Behavior
11.7. 物理页面缓存行为 

By default, the allocator tries to minimize the physical memory owned by a pool. To minimize the OS calls to allocate and free physical memory, applications must configure a memory footprint for each pool. Applications can do this with the release threshold attribute (cudaMemPoolAttrReleaseThreshold).
默认情况下,分配器尝试最小化池所拥有的物理内存。为了最小化分配和释放物理内存的操作系统调用,应用程序必须为每个池配置内存占用量。应用程序可以使用释放阈值属性( cudaMemPoolAttrReleaseThreshold )来实现这一点。

The release threshold is the amount of memory in bytes a pool should hold onto before trying to release memory back to the OS. When more than the release threshold bytes of memory are held by the memory pool, the allocator will try to release memory back to the OS on the next call to stream, event or device synchronize. Setting the release threshold to UINT64_MAX will prevent the driver from attempting to shrink the pool after every synchronization.
发布阈值是池在尝试释放内存回操作系统之前应保留的内存量(以字节为单位)。当内存池持有的内存超过释放阈值字节时,分配器将尝试在下一次调用流、事件或设备同步时将内存释放回操作系统。将释放阈值设置为 UINT64_MAX 将阻止驱动程序在每次同步后尝试缩小池。

Cuuint64_t setVal = UINT64_MAX;
cudaMemPoolSetAttribute(memPool, cudaMemPoolAttrReleaseThreshold, &setVal);

Applications that set cudaMemPoolAttrReleaseThreshold high enough to effectively disable memory pool shrinking may wish to explicitly shrink a memory pool’s memory footprint. cudaMemPoolTrimTo allows such applications to do so. When trimming a memory pool’s footprint, the minBytesToKeep parameter allows an application to hold onto an amount of memory it expects to need in a subsequent phase of execution.
cudaMemPoolAttrReleaseThreshold 设置得足够高,以有效地禁用内存池收缩的应用程序可能希望显式地减小内存池的内存占用。 cudaMemPoolTrimTo 允许这样的应用程序这样做。在修剪内存池的占用空间时, minBytesToKeep 参数允许应用程序保留它预计在执行的后续阶段中需要的内存量。

Cuuint64_t setVal = UINT64_MAX;
cudaMemPoolSetAttribute(memPool, cudaMemPoolAttrReleaseThreshold, &setVal);

// application phase needing a lot of memory from the stream ordered allocator
for (i=0; i<10; i++) {
    for (j=0; j<10; j++) {
        cudaMallocAsync(&ptrs[j],size[j], stream);
    }
    kernel<<<...,stream>>>(ptrs,...);
    for (j=0; j<10; j++) {
        cudaFreeAsync(ptrs[j], stream);
    }
}

// Process does not need as much memory for the next phase.
// Synchronize so that the trim operation will know that the allocations are no
// longer in use.
cudaStreamSynchronize(stream);
cudaMemPoolTrimTo(mempool, 0);

// Some other process/allocation mechanism can now use the physical memory
// released by the trimming operation.

11.8. Resource Usage Statistics
11.8. 资源使用统计 

In CUDA 11.3, the pool attributes cudaMemPoolAttrReservedMemCurrent, cudaMemPoolAttrReservedMemHigh, cudaMemPoolAttrUsedMemCurrent, and cudaMemPoolAttrUsedMemHigh were added to query the memory usage of a pool.
在 CUDA 11.3 中,添加了池属性 cudaMemPoolAttrReservedMemCurrentcudaMemPoolAttrReservedMemHighcudaMemPoolAttrUsedMemCurrentcudaMemPoolAttrUsedMemHigh 以查询池的内存使用情况。

Querying the cudaMemPoolAttrReservedMemCurrent attribute of a pool reports the current total physical GPU memory consumed by the pool. Querying the cudaMemPoolAttrUsedMemCurrent of a pool returns the total size of all of the memory allocated from the pool and not available for reuse.
查询池的 cudaMemPoolAttrReservedMemCurrent 属性会报告池消耗的当前总物理 GPU 内存。查询池的 cudaMemPoolAttrUsedMemCurrent 会返回从池分配且不可重用的所有内存的总大小。

ThecudaMemPoolAttr*MemHigh attributes are watermarks recording the max value achieved by the respective cudaMemPoolAttr*MemCurrent attribute since last reset. They can be reset to the current value by using the cudaMemPoolSetAttribute API.
cudaMemPoolAttr*MemHigh 属性是水印,记录自上次重置以来各自 cudaMemPoolAttr*MemCurrent 属性达到的最大值。可以使用 cudaMemPoolSetAttribute API 将其重置为当前值。

// sample helper functions for getting the usage statistics in bulk
struct usageStatistics {
    cuuint64_t reserved;
    cuuint64_t reservedHigh;
    cuuint64_t used;
    cuuint64_t usedHigh;
};

void getUsageStatistics(cudaMemoryPool_t memPool, struct usageStatistics *statistics)
{
    cudaMemPoolGetAttribute(memPool, cudaMemPoolAttrReservedMemCurrent, statistics->reserved);
    cudaMemPoolGetAttribute(memPool, cudaMemPoolAttrReservedMemHigh, statistics->reservedHigh);
    cudaMemPoolGetAttribute(memPool, cudaMemPoolAttrUsedMemCurrent, statistics->used);
    cudaMemPoolGetAttribute(memPool, cudaMemPoolAttrUsedMemHigh, statistics->usedHigh);
}


// resetting the watermarks will make them take on the current value.
void resetStatistics(cudaMemoryPool_t memPool)
{
    cuuint64_t value = 0;
    cudaMemPoolSetAttribute(memPool, cudaMemPoolAttrReservedMemHigh, &value);
    cudaMemPoolSetAttribute(memPool, cudaMemPoolAttrUsedMemHigh, &value);
}

11.9. Memory Reuse Policies
11.9. 内存重用策略 

In order to service an allocation request, the driver attempts to reuse memory that was previously freed via cudaFreeAsync() before attempting to allocate more memory from the OS. For example, memory freed in a stream can immediately be reused for a subsequent allocation request in the same stream. Similarly, when a stream is synchronized with the CPU, the memory that was previously freed in that stream becomes available for reuse for an allocation in any stream.
为了服务分配请求,驱动程序在尝试从操作系统分配更多内存之前,尝试重用先前通过 cudaFreeAsync() 释放的内存。例如,在流中释放的内存可以立即用于同一流中后续的分配请求。同样,当流与 CPU 同步时,在该流中先前释放的内存可供任何流中的分配重用。

The stream ordered allocator has a few controllable allocation policies. The pool attributes cudaMemPoolReuseFollowEventDependencies, cudaMemPoolReuseAllowOpportunistic, and cudaMemPoolReuseAllowInternalDependencies control these policies. Upgrading to a newer CUDA driver may change, enhance, augment and/or reorder the reuse policies.
流有序分配器具有一些可控的分配策略。池属性 cudaMemPoolReuseFollowEventDependenciescudaMemPoolReuseAllowOpportunisticcudaMemPoolReuseAllowInternalDependencies 控制这些策略。升级到更新的 CUDA 驱动程序可能会更改、增强、增加和/或重新排序重用策略。

11.9.1. cudaMemPoolReuseFollowEventDependencies

Before allocating more physical GPU memory, the allocator examines dependency information established by CUDA events and tries to allocate from memory freed in another stream.
在分配更多物理 GPU 内存之前,分配器会检查由 CUDA 事件建立的依赖信息,并尝试从另一个流中释放的内存中分配。

cudaMallocAsync(&ptr, size, originalStream);
kernel<<<..., originalStream>>>(ptr, ...);
cudaFreeAsync(ptr, originalStream);
cudaEventRecord(event,originalStream);

// waiting on the event that captures the free in another stream
// allows the allocator to reuse the memory to satisfy
// a new allocation request in the other stream when
// cudaMemPoolReuseFollowEventDependencies is enabled.
cudaStreamWaitEvent(otherStream, event);
cudaMallocAsync(&ptr2, size, otherStream);

11.9.2. cudaMemPoolReuseAllowOpportunistic

According to the cudaMemPoolReuseAllowOpportunistic policy, the allocator examines freed allocations to see if the free’s stream order semantic has been met (such as the stream has passed the point of execution indicated by the free). When this is disabled, the allocator will still reuse memory made available when a stream is synchronized with the CPU. Disabling this policy does not stop the cudaMemPoolReuseFollowEventDependencies from applying.
根据 cudaMemPoolReuseAllowOpportunistic 策略,分配器会检查已释放的分配以查看是否满足了释放流顺序语义(例如,流已经通过了释放所指示的执行点)。当禁用此功能时,分配器仍会重用在流与 CPU 同步时释放的内存。禁用此策略不会阻止 cudaMemPoolReuseFollowEventDependencies 的应用。

cudaMallocAsync(&ptr, size, originalStream);
kernel<<<..., originalStream>>>(ptr, ...);
cudaFreeAsync(ptr, originalStream);


// after some time, the kernel finishes running
wait(10);

// When cudaMemPoolReuseAllowOpportunistic is enabled this allocation request
// can be fulfilled with the prior allocation based on the progress of originalStream.
cudaMallocAsync(&ptr2, size, otherStream);

11.9.3. cudaMemPoolReuseAllowInternalDependencies

Failing to allocate and map more physical memory from the OS, the driver will look for memory whose availability depends on another stream’s pending progress. If such memory is found, the driver will insert the required dependency into the allocating stream and reuse the memory.
未能从操作系统中分配和映射更多的物理内存,驱动程序将寻找取决于另一个流挂起进度的内存。如果找到这样的内存,驱动程序将在分配流中插入所需的依赖关系并重用内存。

cudaMallocAsync(&ptr, size, originalStream);
kernel<<<..., originalStream>>>(ptr, ...);
cudaFreeAsync(ptr, originalStream);

// When cudaMemPoolReuseAllowInternalDependencies is enabled
// and the driver fails to allocate more physical memory, the driver may
// effectively perform a cudaStreamWaitEvent in the allocating stream
// to make sure that future work in ‘otherStream’ happens after the work
// in the original stream that would be allowed to access the original allocation.
cudaMallocAsync(&ptr2, size, otherStream);

11.9.4. Disabling Reuse Policies
11.9.4. 禁用重用策略 

While the controllable reuse policies improve memory reuse, users may want to disable them. Allowing opportunistic reuse (such as cudaMemPoolReuseAllowOpportunistic) introduces run to run variance in allocation patterns based on the interleaving of CPU and GPU execution. Internal dependency insertion (such as cudaMemPoolReuseAllowInternalDependencies) can serialize work in unexpected and potentially non-deterministic ways when the user would rather explicitly synchronize an event or stream on allocation failure.
虽然可控的重用策略提高了内存重用,但用户可能希望禁用它们。允许机会性重用(例如 cudaMemPoolReuseAllowOpportunistic )会根据 CPU 和 GPU 执行的交错而引入运行时变化,从而改变分配模式。内部依赖插入(例如 cudaMemPoolReuseAllowInternalDependencies )可能以意外和潜在的非确定性方式串行化工作,当用户更希望在分配失败时显式同步事件或流。

11.10. Device Accessibility for Multi-GPU Support
11.10. 多 GPU 支持的设备可访问性 

Just like allocation accessibility controlled through the virtual memory management APIs, memory pool allocation accessibility does not follow cudaDeviceEnablePeerAccess or cuCtxEnablePeerAccess. Instead, the API cudaMemPoolSetAccess modifies what devices can access allocations from a pool. By default, allocations are accessible from the device where the allocations are located. This access cannot be revoked. To enable access from other devices, the accessing device must be peer capable with the memory pool’s device; check with cudaDeviceCanAccessPeer. If the peer capability is not checked, the set access may fail with cudaErrorInvalidDevice. If no allocations had been made from the pool, the cudaMemPoolSetAccess call may succeed even when the devices are not peer capable; in this case, the next allocation from the pool will fail.
就像通过虚拟内存管理 API 控制的分配可访问性一样,内存池分配可访问性不遵循 cudaDeviceEnablePeerAccesscuCtxEnablePeerAccess 。相反,API cudaMemPoolSetAccess 修改了哪些设备可以访问来自池的分配。默认情况下,分配可从分配所在的设备访问。此访问权限无法撤销。要从其他设备启用访问权限,访问设备必须具有与内存池设备相对应的对等能力;请使用 cudaDeviceCanAccessPeer 进行检查。如果未检查对等能力,则设置访问可能会失败,出现 cudaErrorInvalidDevice 。如果从池中没有进行任何分配,则 cudaMemPoolSetAccess 调用可能会成功,即使设备没有对等能力;在这种情况下,从池中进行的下一个分配将失败。

It is worth noting that cudaMemPoolSetAccess affects all allocations from the memory pool, not just future ones. Also the accessibility reported by cudaMemPoolGetAccess applies to all allocations from the pool, not just future ones. It is recommended that the accessibility settings of a pool for a given GPU not be changed frequently; once a pool is made accessible from a given GPU, it should remain accessible from that GPU for the lifetime of the pool.
值得注意的是 cudaMemPoolSetAccess 影响内存池中的所有分配,而不仅仅是未来的分配。此外, cudaMemPoolGetAccess 报告的可访问性适用于内存池中的所有分配,而不仅仅是未来的分配。建议不要频繁更改给定 GPU 的内存池的可访问性设置;一旦从给定 GPU 使内存池可访问,应该在内存池的生命周期内保持对该 GPU 的可访问性。

// snippet showing usage of cudaMemPoolSetAccess:
cudaError_t setAccessOnDevice(cudaMemPool_t memPool, int residentDevice,
              int accessingDevice) {
    cudaMemAccessDesc accessDesc = {};
    accessDesc.location.type = cudaMemLocationTypeDevice;
    accessDesc.location.id = accessingDevice;
    accessDesc.flags = cudaMemAccessFlagsProtReadWrite;

    int canAccess = 0;
    cudaError_t error = cudaDeviceCanAccessPeer(&canAccess, accessingDevice,
              residentDevice);
    if (error != cudaSuccess) {
        return error;
    } else if (canAccess == 0) {
        return cudaErrorPeerAccessUnsupported;
    }

    // Make the address accessible
    return cudaMemPoolSetAccess(memPool, &accessDesc, 1);
}

11.11. IPC Memory Pools
11.11. IPC 内存池 

IPC capable memory pools allow easy, efficient and secure sharing of GPU memory between processes. CUDA’s IPC memory pools provide the same security benefits as CUDA’s virtual memory management APIs.
IPC 能力内存池允许进程之间轻松、高效和安全地共享 GPU 内存。CUDA 的 IPC 内存池提供与 CUDA 的虚拟内存管理 API 相同的安全性好处。

There are two phases to sharing memory between processes with memory pools. The processes first need to share access to the pool, then share specific allocations from that pool. The first phase establishes and enforces security. The second phase coordinates what virtual addresses are used in each process and when mappings need to be valid in the importing process.
在进程之间使用内存池共享内存有两个阶段。首先,进程需要共享对内存池的访问权限,然后共享从该内存池分配的特定内存。第一阶段建立并强制执行安全性。第二阶段协调每个进程中使用的虚拟地址以及在导入进程中需要映射有效的时间。

11.11.1. Creating and Sharing IPC Memory Pools
11.11.1. 创建和共享 IPC 内存池 

Sharing access to a pool involves retrieving an OS native handle to the pool (with the cudaMemPoolExportToShareableHandle() API), transferring the handle to the importing process using the usual OS native IPC mechanisms, and creating an imported memory pool (with the cudaMemPoolImportFromShareableHandle() API). For cudaMemPoolExportToShareableHandle to succeed, the memory pool had to be created with the requested handle type specified in the pool properties structure. Please reference samples for the appropriate IPC mechanisms to transfer the OS native handle between processes. The rest of the procedure can be found in the following code snippets.
共享对池的访问涉及检索池的 OS 本机句柄(使用 cudaMemPoolExportToShareableHandle() API),使用通常的 OS 本机 IPC 机制将句柄传输到导入进程,并创建导入的内存池(使用 cudaMemPoolImportFromShareableHandle() API)。要使 cudaMemPoolExportToShareableHandle 成功,内存池必须使用池属性结构中指定的请求句柄类型创建。请参考示例以获取适当的 IPC 机制,以在进程之间传输 OS 本机句柄。其余步骤可以在以下代码片段中找到。

// in exporting process
// create an exportable IPC capable pool on device 0
cudaMemPoolProps poolProps = { };
poolProps.allocType = cudaMemAllocationTypePinned;
poolProps.location.id = 0;
poolProps.location.type = cudaMemLocationTypeDevice;

// Setting handleTypes to a non zero value will make the pool exportable (IPC capable)
poolProps.handleTypes = CU_MEM_HANDLE_TYPE_POSIX_FILE_DESCRIPTOR;

cudaMemPoolCreate(&memPool, &poolProps));

// FD based handles are integer types
int fdHandle = 0;


// Retrieve an OS native handle to the pool.
// Note that a pointer to the handle memory is passed in here.
cudaMemPoolExportToShareableHandle(&fdHandle,
             memPool,
             CU_MEM_HANDLE_TYPE_POSIX_FILE_DESCRIPTOR,
             0);

// The handle must be sent to the importing process with the appropriate
// OS specific APIs.
// in importing process
 int fdHandle;
// The handle needs to be retrieved from the exporting process with the
// appropriate OS specific APIs.
// Create an imported pool from the shareable handle.
// Note that the handle is passed by value here.
cudaMemPoolImportFromShareableHandle(&importedMemPool,
          (void*)fdHandle,
          CU_MEM_HANDLE_TYPE_POSIX_FILE_DESCRIPTOR,
          0);

11.11.2. Set Access in the Importing Process
11.11.2. 在导入过程中设置访问权限 

Imported memory pools are initially only accessible from their resident device. The imported memory pool does not inherit any accessibility set by the exporting process. The importing process needs to enable access (with cudaMemPoolSetAccess) from any GPU it plans to access the memory from.
导入的内存池最初只能从其所在设备访问。导入的内存池不会继承导出进程设置的任何可访问性。导入进程需要从计划访问内存的任何 GPU 启用访问权限(使用 cudaMemPoolSetAccess )。

If the imported memory pool belongs to a non-visible device in the importing process, the user must use the cudaMemPoolSetAccess API to enable access from the GPUs the allocations will be used on.
如果导入的内存池属于在导入过程中不可见的设备,则用户必须使用 cudaMemPoolSetAccess API 从将要使用的 GPU 启用访问。

11.11.3. Creating and Sharing Allocations from an Exported Pool
11.11.3. 从导出池创建和共享分配

Once the pool has been shared, allocations made with cudaMallocAsync() from the pool in the exporting process can be shared with other processes that have imported the pool. Since the pool’s security policy is established and verified at the pool level, the OS does not need extra bookkeeping to provide security for specific pool allocations; In other words, the opaque cudaMemPoolPtrExportData required to import a pool allocation may be sent to the importing process using any mechanism.
一旦池已共享,使用 cudaMallocAsync() 从池中进行的分配可以与已导入池的其他进程共享。由于池的安全策略是在池级别上建立和验证的,操作系统不需要额外的簿记来为特定池分配提供安全性;换句话说,用于导入池分配的不透明 cudaMemPoolPtrExportData 可以通过任何机制发送到导入进程。

While allocations may be exported and even imported without synchronizing with the allocating stream in any way, the importing process must follow the same rules as the exporting process when accessing the allocation. Namely, access to the allocation must happen after the stream ordering of the allocation operation in the allocating stream. The two following code snippets show cudaMemPoolExportPointer() and cudaMemPoolImportPointer() sharing the allocation with an IPC event used to guarantee that the allocation isn’t accessed in the importing process before the allocation is ready.
虽然分配可能会被导出甚至导入,而且在任何情况下都不需要与分配流同步,但在访问分配时,导入过程必须遵循与导出过程相同的规则。即,在分配流中的分配操作的流顺序之后必须访问分配。下面的两个代码片段显示 cudaMemPoolExportPointer()cudaMemPoolImportPointer() 共享分配,其中使用 IPC 事件来确保在分配准备就绪之前不会在导入过程中访问分配。

// preparing an allocation in the exporting process
cudaMemPoolPtrExportData exportData;
cudaEvent_t readyIpcEvent;
cudaIpcEventHandle_t readyIpcEventHandle;

// ipc event for coordinating between processes
// cudaEventInterprocess flag makes the event an ipc event
// cudaEventDisableTiming  is set for performance reasons

cudaEventCreate(
        &readyIpcEvent, cudaEventDisableTiming | cudaEventInterprocess)

// allocate from the exporting mem pool
cudaMallocAsync(&ptr, size,exportMemPool, stream);

// event for sharing when the allocation is ready.
cudaEventRecord(readyIpcEvent, stream);
cudaMemPoolExportPointer(&exportData, ptr);
cudaIpcGetEventHandle(&readyIpcEventHandle, readyIpcEvent);

// Share IPC event and pointer export data with the importing process using
//  any mechanism. Here we copy the data into shared memory
shmem->ptrData = exportData;
shmem->readyIpcEventHandle = readyIpcEventHandle;
// signal consumers data is ready
// Importing an allocation
cudaMemPoolPtrExportData *importData = &shmem->prtData;
cudaEvent_t readyIpcEvent;
cudaIpcEventHandle_t *readyIpcEventHandle = &shmem->readyIpcEventHandle;

// Need to retrieve the ipc event handle and the export data from the
// exporting process using any mechanism.  Here we are using shmem and just
// need synchronization to make sure the shared memory is filled in.

cudaIpcOpenEventHandle(&readyIpcEvent, readyIpcEventHandle);

// import the allocation. The operation does not block on the allocation being ready.
cudaMemPoolImportPointer(&ptr, importedMemPool, importData);

// Wait for the prior stream operations in the allocating stream to complete before
// using the allocation in the importing process.
cudaStreamWaitEvent(stream, readyIpcEvent);
kernel<<<..., stream>>>(ptr, ...);

When freeing the allocation, the allocation needs to be freed in the importing process before it is freed in the exporting process. The following code snippet demonstrates the use of CUDA IPC events to provide the required synchronization between the cudaFreeAsync operations in both processes. Access to the allocation from the importing process is obviously restricted by the free operation in the importing process side. It is worth noting that cudaFree can be used to free the allocation in both processes and that other stream synchronization APIs may be used instead of CUDA IPC events.
在释放分配时,需要在导入过程中先释放分配,然后再在导出过程中释放。以下代码片段演示了使用 CUDA IPC 事件来提供所需的 cudaFreeAsync 操作之间的同步。显然,从导入过程中访问分配受到导入过程一侧的释放操作的限制。值得注意的是, cudaFree 可用于在两个过程中释放分配,并且可以使用其他流同步 API 来代替 CUDA IPC 事件。

// The free must happen in importing process before the exporting process
kernel<<<..., stream>>>(ptr, ...);

// Last access in importing process
cudaFreeAsync(ptr, stream);

// Access not allowed in the importing process after the free
cudaIpcEventRecord(finishedIpcEvent, stream);
// Exporting process
// The exporting process needs to coordinate its free with the stream order
// of the importing process’s free.
cudaStreamWaitEvent(stream, finishedIpcEvent);
kernel<<<..., stream>>>(ptrInExportingProcess, ...);

// The free in the importing process doesn’t stop the exporting process
// from using the allocation.
cudFreeAsync(ptrInExportingProcess,stream);

11.11.4. IPC Export Pool Limitations
11.11.4. IPC 导出池限制 

IPC pools currently do not support releasing physical blocks back to the OS. As a result the cudaMemPoolTrimTo API acts as a no-op and the cudaMemPoolAttrReleaseThreshold effectively gets ignored. This behavior is controlled by the driver, not the runtime and may change in a future driver update.
IPC 池目前不支持将物理块释放回操作系统。因此, cudaMemPoolTrimTo API 充当无操作,并且 cudaMemPoolAttrReleaseThreshold 被有效地忽略。此行为由驱动程序控制,而不是运行时,并且可能在将来的驱动程序更新中更改。

11.11.5. IPC Import Pool Limitations
11.11.5. IPC 导入池限制 

Allocating from an import pool is not allowed; specifically, import pools cannot be set current and cannot be used in the cudaMallocFromPoolAsync API. As such, the allocation reuse policy attributes are meaningless for these pools.
从导入池中分配是不允许的;具体来说,导入池不能被设置为当前,并且不能在 cudaMallocFromPoolAsync API 中使用。因此,对于这些池,分配重用策略属性是没有意义的。

IPC pools currently do not support releasing physical blocks back to the OS. As a result the cudaMemPoolTrimTo API acts as a no-op and the cudaMemPoolAttrReleaseThreshold effectively gets ignored.
IPC 池目前不支持将物理块释放回操作系统。因此, cudaMemPoolTrimTo API 充当无操作, cudaMemPoolAttrReleaseThreshold 被有效忽略。

The resource usage stat attribute queries only reflect the allocations imported into the process and the associated physical memory.
资源使用统计属性查询仅反映导入进程和关联物理内存中的分配。

11.12. Synchronization API Actions
11.12. 同步 API 操作 

One of the optimizations that comes with the allocator being part of the CUDA driver is integration with the synchronize APIs. When the user requests that the CUDA driver synchronize, the driver waits for asynchronous work to complete. Before returning, the driver will determine what frees the synchronization guaranteed to be completed. These allocations are made available for allocation regardless of specified stream or disabled allocation policies. The driver also checks cudaMemPoolAttrReleaseThreshold here and releases any excess physical memory that it can.
随着分配器成为 CUDA 驱动程序的一部分而带来的优化之一是与同步 API 的集成。当用户请求 CUDA 驱动程序同步时,驱动程序会等待异步工作完成。在返回之前,驱动程序将确定释放同步保证已完成的内容。这些分配可供分配,无论指定的流还是禁用的分配策略如何。驱动程序还在此处检查 cudaMemPoolAttrReleaseThreshold 并释放任何多余的物理内存。

11.13. Addendums 11.13. 附录 

11.13.1. cudaMemcpyAsync Current Context/Device Sensitivity
11.13.1. cudaMemcpyAsync 当前上下文/设备敏感性 

In the current CUDA driver, any async memcpy involving memory from cudaMallocAsync should be done using the specified stream’s context as the calling thread’s current context. This is not necessary for cudaMemcpyPeerAsync, as the device primary contexts specified in the API are referenced instead of the current context.
在当前的 CUDA 驱动程序中,涉及来自 cudaMallocAsync 的内存的任何异步 memcpy 都应该使用指定流的上下文作为调用线程的当前上下文来完成。对于 cudaMemcpyPeerAsync ,这是不必要的,因为 API 中指定的设备主上下文是当前上下文的引用,而不是当前上下文。

11.13.2. cuPointerGetAttribute Query
11.13.2. cuPointerGetAttribute 查询 

Invoking cuPointerGetAttribute on an allocation after invoking cudaFreeAsync on it results in undefined behavior. Specifically, it does not matter if an allocation is still accessible from a given stream: the behavior is still undefined.
在调用 cudaFreeAsync 之后调用分配 cuPointerGetAttribute 会导致未定义的行为。具体来说,从给定流仍然可以访问分配与否并不重要:行为仍然是未定义的。

11.13.3. cuGraphAddMemsetNode

cuGraphAddMemsetNode does not work with memory allocated via the stream ordered allocator. However, memsets of the allocations can be stream captured.
cuGraphAddMemsetNode 无法与通过流有序分配器分配的内存一起使用。但是,可以捕获分配的内存的流。

11.13.4. Pointer Attributes
11.13.4. 指针属性 

The cuPointerGetAttributes query works on stream ordered allocations. Since stream ordered allocations are not context associated, querying CU_POINTER_ATTRIBUTE_CONTEXT will succeed but return NULL in *data. The attribute CU_POINTER_ATTRIBUTE_DEVICE_ORDINAL can be used to determine the location of the allocation: this can be useful when selecting a context for making p2h2p copies using cudaMemcpyPeerAsync. The attribute CU_POINTER_ATTRIBUTE_MEMPOOL_HANDLE was added in CUDA 11.3 and can be useful for debugging and for confirming which pool an allocation comes from before doing IPC.
cuPointerGetAttributes 查询适用于流有序分配。由于流有序分配不与上下文关联,因此查询 CU_POINTER_ATTRIBUTE_CONTEXT 将成功,但在 *data 中返回 NULL。属性 CU_POINTER_ATTRIBUTE_DEVICE_ORDINAL 可用于确定分配的位置:在使用 cudaMemcpyPeerAsync 进行 p2h2p 复制时,这可能很有用。属性 CU_POINTER_ATTRIBUTE_MEMPOOL_HANDLE 在 CUDA 11.3 中添加,可用于调试和在执行 IPC 之前确认分配来自哪个池。

12. Graph Memory Nodes
12. 图形内存节点 

12.1. Introduction 12.1. 简介 

Graph memory nodes allow graphs to create and own memory allocations. Graph memory nodes have GPU ordered lifetime semantics, which dictate when memory is allowed to be accessed on the device. These GPU ordered lifetime semantics enable driver-managed memory reuse, and match those of the stream ordered allocation APIs cudaMallocAsync and cudaFreeAsync, which may be captured when creating a graph.
图形内存节点允许图形创建和拥有内存分配。图形内存节点具有 GPU 有序生命周期语义,指导内存何时允许在设备上访问。这些 GPU 有序生命周期语义使驱动程序管理的内存重用成为可能,并与流有序分配 API cudaMallocAsynccudaFreeAsync 的语义相匹配,这些语义可能在创建图形时被捕获。

Graph allocations have fixed addresses over the life of a graph including repeated instantiations and launches. This allows the memory to be directly referenced by other operations within the graph without the need of a graph update, even when CUDA changes the backing physical memory. Within a graph, allocations whose graph ordered lifetimes do not overlap may use the same underlying physical memory.
图形分配在图的整个生命周期中具有固定地址,包括重复实例化和启动。这使得内存可以被图中的其他操作直接引用,而无需进行图更新,即使 CUDA 更改了支持的物理内存。在图中,图形排序生命周期不重叠的分配可以使用相同的基础物理内存。

CUDA may reuse the same physical memory for allocations across multiple graphs, aliasing virtual address mappings according to the GPU ordered lifetime semantics. For example when different graphs are launched into the same stream, CUDA may virtually alias the same physical memory to satisfy the needs of allocations which have single-graph lifetimes.
CUDA 可能会跨多个图形重复使用相同的物理内存进行分配,根据 GPU 有序生命周期语义对虚拟地址映射进行别名处理。例如,当不同的图形被启动到同一流中时,CUDA 可能会在虚拟上别名相同的物理内存,以满足具有单图形生命周期的分配需求。

12.2. Support and Compatibility
12.2. 支持和兼容性 

Graph memory nodes require an 11.4 capable CUDA driver and support for the stream ordered allocator on the GPU. The following snippet shows how to check for support on a given device.
图形内存节点需要一个 11.4 版本的 CUDA 驱动程序,并支持 GPU 上的流有序分配器。以下代码片段显示了如何检查特定设备上的支持。

int driverVersion = 0;
int deviceSupportsMemoryPools = 0;
int deviceSupportsMemoryNodes = 0;
cudaDriverGetVersion(&driverVersion);
if (driverVersion >= 11020) { // avoid invalid value error in cudaDeviceGetAttribute
    cudaDeviceGetAttribute(&deviceSupportsMemoryPools, cudaDevAttrMemoryPoolsSupported, device);
}
deviceSupportsMemoryNodes = (driverVersion >= 11040) && (deviceSupportsMemoryPools != 0);

Doing the attribute query inside the driver version check avoids an invalid value return code on 11.0 and 11.1 drivers. Be aware that the compute sanitizer emits warnings when it detects CUDA returning error codes, and a version check before reading the attribute will avoid this. Graph memory nodes are only supported on driver versions 11.4 and newer.
在驱动程序版本检查中执行属性查询可以避免在 11.0 和 11.1 驱动程序上返回无效值的返回代码。请注意,当计算检查器检测到 CUDA 返回错误代码时会发出警告,因此在读取属性之前进行版本检查将避免此问题。图形内存节点仅受支持于 11.4 及更新版本的驱动程序。

12.3. API Fundamentals
12.3. API 基础 

Graph memory nodes are graph nodes representing either memory allocation or free actions. As a shorthand, nodes that allocate memory are called allocation nodes. Likewise, nodes that free memory are called free nodes. Allocations created by allocation nodes are called graph allocations. CUDA assigns virtual addresses for the graph allocation at node creation time. While these virtual addresses are fixed for the lifetime of the allocation node, the allocation contents are not persistent past the freeing operation and may be overwritten by accesses referring to a different allocation.
图形内存节点是表示内存分配或释放操作的图节点。简而言之,分配内存的节点称为分配节点。同样,释放内存的节点称为释放节点。由分配节点创建的分配称为图分配。CUDA 在节点创建时为图分配分配虚拟地址。虽然这些虚拟地址在分配节点的生命周期内是固定的,但分配内容在释放操作后不是持久的,并且可能被引用不同分配的访问覆盖。

Graph allocations are considered recreated every time a graph runs. A graph allocation’s lifetime, which differs from the node’s lifetime, begins when GPU execution reaches the allocating graph node and ends when one of the following occurs:
图形分配被认为在每次运行图形时重新创建。图形分配的生命周期与节点的生命周期不同,它始于 GPU 执行到达分配图形节点时,终止于以下情况之一发生时:

  • GPU execution reaches the freeing graph node
    GPU 执行到释放图节点

  • GPU execution reaches the freeing cudaFreeAsync() stream call
    GPU 执行达到释放 cudaFreeAsync() 流调用

  • immediately upon the freeing call to cudaFree()
    立即在释放调用 cudaFree() 之后

Note 注意

Graph destruction does not automatically free any live graph-allocated memory, even though it ends the lifetime of the allocation node. The allocation must subsequently be freed in another graph, or using cudaFreeAsync()/cudaFree().
图破坏并不会自动释放任何活动图分配的内存,即使它结束了分配节点的生命周期。分配必须随后在另一个图中释放,或者使用 cudaFreeAsync() /cudaFree()

Just like other graph nodes, graph memory nodes are ordered within a graph by dependency edges. A program must guarantee that operations accessing graph memory:
就像其他图节点一样,图内存节点在图中由依赖边排序。程序必须保证访问图内存的操作:

  • are ordered after the allocation node
    在分配节点之后排序

  • are ordered before the operation freeing the memory
    在释放内存之前按顺序排列

Graph allocation lifetimes begin and usually end according to GPU execution (as opposed to API invocation). GPU ordering is the order that work runs on the GPU as opposed to the order that the work is enqueued or described. Thus, graph allocations are considered ‘GPU ordered.’
图形分配生命周期始于 GPU 执行,并通常根据 GPU 执行结束(而非 API 调用)。 GPU 排序是工作在 GPU 上运行的顺序,而不是工作被入队或描述的顺序。因此,图形分配被视为“GPU 有序”。

12.3.1. Graph Node APIs
12.3.1. 图节点 API 

Graph memory nodes may be explicitly created with the memory node creation APIs, cudaGraphAddMemAllocNode and cudaGraphAddMemFreeNode. The address allocated by cudaGraphAddMemAllocNode is returned to the user in the dptr field of the passed CUDA_MEM_ALLOC_NODE_PARAMS structure. All operations using graph allocations inside the allocating graph must be ordered after the allocating node. Similarly, any free nodes must be ordered after all uses of the allocation within the graph. cudaGraphAddMemFreeNode creates free nodes.
图形内存节点可以使用内存节点创建 API cudaGraphAddMemAllocNodecudaGraphAddMemFreeNode 明确创建。由 cudaGraphAddMemAllocNode 分配的地址将在传递的 CUDA_MEM_ALLOC_NODE_PARAMS 结构的 dptr 字段中返回给用户。在分配节点之后,必须按顺序执行所有使用分配图内存的操作。同样,任何释放节点必须在图内存分配的所有使用之后执行。 cudaGraphAddMemFreeNode 创建释放节点。

In the following figure, there is an example graph with an alloc and a free node. Kernel nodes a, b, and c are ordered after the allocation node and before the free node such that the kernels can access the allocation. Kernel node e is not ordered after the alloc node and therefore cannot safely access the memory. Kernel node d is not ordered before the free node, therefore it cannot safely access the memory.
在下图中,有一个带有 alloc 和 free 节点的示例图。内核节点 a、b 和 c 在分配节点之后和释放节点之前排序,以便内核可以访问分配。内核节点 e 不在 alloc 节点之后排序,因此无法安全地访问内存。内核节点 d 不在 free 节点之前排序,因此无法安全地访问内存。

Kernel Nodes

Figure 28 Kernel Nodes
图 28 内核节点 

The following code snippet establishes the graph in this figure:
以下代码片段在此图中建立了图形:

// Create the graph - it starts out empty
cudaGraphCreate(&graph, 0);

// parameters for a basic allocation
cudaMemAllocNodeParams params = {};
params.poolProps.allocType = cudaMemAllocationTypePinned;
params.poolProps.location.type = cudaMemLocationTypeDevice;
// specify device 0 as the resident device
params.poolProps.location.id = 0;
params.bytesize = size;

cudaGraphAddMemAllocNode(&allocNode, graph, NULL, 0, &params);
nodeParams->kernelParams[0] = params.dptr;
cudaGraphAddKernelNode(&a, graph, &allocNode, 1, &nodeParams);
cudaGraphAddKernelNode(&b, graph, &a, 1, &nodeParams);
cudaGraphAddKernelNode(&c, graph, &a, 1, &nodeParams);
cudaGraphNode_t dependencies[2];
// kernel nodes b and c are using the graph allocation, so the freeing node must depend on them.  Since the dependency of node b on node a establishes an indirect dependency, the free node does not need to explicitly depend on node a.
dependencies[0] = b;
dependencies[1] = c;
cudaGraphAddMemFreeNode(&freeNode, graph, dependencies, 2, params.dptr);
// free node does not depend on kernel node d, so it must not access the freed graph allocation.
cudaGraphAddKernelNode(&d, graph, &c, 1, &nodeParams);

// node e does not depend on the allocation node, so it must not access the allocation.  This would be true even if the freeNode depended on kernel node e.
cudaGraphAddKernelNode(&e, graph, NULL, 0, &nodeParams);

12.3.2. Stream Capture
12.3.2. 流捕获 

Graph memory nodes can be created by capturing the corresponding stream ordered allocation and free calls cudaMallocAsync and cudaFreeAsync. In this case, the virtual addresses returned by the captured allocation API can be used by other operations inside the graph. Since the stream ordered dependencies will be captured into the graph, the ordering requirements of the stream ordered allocation APIs guarantee that the graph memory nodes will be properly ordered with respect to the captured stream operations (for correctly written stream code).
图形内存节点可以通过捕获相应的流有序分配和释放调用 cudaMallocAsynccudaFreeAsync 来创建。在这种情况下,捕获的分配 API 返回的虚拟地址可以被图形内的其他操作使用。由于流有序依赖关系将被捕获到图中,流有序分配 API 的排序要求保证了图形内存节点将与捕获的流操作(对于正确编写的流代码)正确排序。

Ignoring kernel nodes d and e, for clarity, the following code snippet shows how to use stream capture to create the graph from the previous figure:
忽略内核节点 d 和 e,为了清晰起见,以下代码片段显示了如何使用流捕获来从上图创建图形:

cudaMallocAsync(&dptr, size, stream1);
kernel_A<<< ..., stream1 >>>(dptr, ...);

// Fork into stream2
cudaEventRecord(event1, stream1);
cudaStreamWaitEvent(stream2, event1);

kernel_B<<< ..., stream1 >>>(dptr, ...);
// event dependencies translated into graph dependencies, so the kernel node created by the capture of kernel C will depend on the allocation node created by capturing the cudaMallocAsync call.
kernel_C<<< ..., stream2 >>>(dptr, ...);

// Join stream2 back to origin stream (stream1)
cudaEventRecord(event2, stream2);
cudaStreamWaitEvent(stream1, event2);

// Free depends on all work accessing the memory.
cudaFreeAsync(dptr, stream1);

// End capture in the origin stream
cudaStreamEndCapture(stream1, &graph);

12.3.3. Accessing and Freeing Graph Memory Outside of the Allocating Graph
12.3.3. 访问和释放分配图之外的图内存 

Graph allocations do not have to be freed by the allocating graph. When a graph does not free an allocation, that allocation persists beyond the execution of the graph and can be accessed by subsequent CUDA operations. These allocations may be accessed in another graph or directly using a stream operation as long as the accessing operation is ordered after the allocation through CUDA events and other stream ordering mechanisms. An allocation may subsequently be freed by regular calls to cudaFree, cudaFreeAsync, or by the launch of another graph with a corresponding free node, or a subsequent launch of the allocating graph (if it was instantiated with the cudaGraphInstantiateFlagAutoFreeOnLaunch flag). It is illegal to access memory after it has been freed - the free operation must be ordered after all operations accessing the memory using graph dependencies, CUDA events, and other stream ordering mechanisms.
图形分配不必由分配图释放。当图形不释放分配时,该分配将在图形执行之后持续存在,并可以被后续的 CUDA 操作访问。只要访问操作在通过 CUDA 事件和其他流排序机制的分配之后有序,这些分配可以在另一个图形中或直接使用流操作中访问。分配可以通过定期调用 cudaFreecudaFreeAsync 释放,或者通过启动具有相应释放节点的另一个图形,或者通过启动分配图形(如果它是使用 cudaGraphInstantiateFlagAutoFreeOnLaunch 标志实例化的)来释放。在释放内存后访问内存是非法的 - 释放操作必须在使用图形依赖关系、CUDA 事件和其他流排序机制访问内存的所有操作之后有序。

Note 注意

Because graph allocations may share underlying physical memory with each other, the Virtual Aliasing Support rules relating to consistency and coherency must be considered. Simply put, the free operation must be ordered after the full device operation (for example, compute kernel / memcpy) completes. Specifically, out of band synchronization - for example a handshake through memory as part of a compute kernel that accesses the graph-allocated memory - is not sufficient for providing ordering guarantees between the memory writes to graph memory and the free operation of that graph memory.
由于图形分配可能共享相同的物理内存,因此必须考虑与一致性和一致性相关的虚拟别名支持规则。简而言之,释放操作必须在完整设备操作(例如,计算内核/ memcpy)完成后进行排序。具体而言,带外同步 - 例如,通过内存进行握手作为访问图形分配内存的计算内核的一部分 - 不足以提供内存写入图形内存和释放该图形内存的操作之间的排序保证。

The following code snippets demonstrate accessing graph allocations outside of the allocating graph with ordering properly established by: using a single stream, using events between streams, and using events baked into the allocating and freeing graph.
以下代码片段演示了通过以下方式正确建立顺序来访问分配图之外的图分配:使用单个流、在流之间使用事件以及在分配和释放图中使用内置事件。

Ordering established by using a single stream:
通过使用单个流建立的排序:

void *dptr;
cudaGraphAddMemAllocNode(&allocNode, allocGraph, NULL, 0, &params);
dptr = params.dptr;

cudaGraphInstantiate(&allocGraphExec, allocGraph, NULL, NULL, 0);

cudaGraphLaunch(allocGraphExec, stream);
kernel<<< …, stream >>>(dptr, …);
cudaFreeAsync(dptr, stream);

Ordering established by recording and waiting on CUDA events:
通过记录和等待 CUDA 事件建立的顺序:

void *dptr;

// Contents of allocating graph
cudaGraphAddMemAllocNode(&allocNode, allocGraph, NULL, 0, &params);
dptr = params.dptr;

// contents of consuming/freeing graph
nodeParams->kernelParams[0] = params.dptr;
cudaGraphAddKernelNode(&a, graph, NULL, 0, &nodeParams);
cudaGraphAddMemFreeNode(&freeNode, freeGraph, &a, 1, dptr);

cudaGraphInstantiate(&allocGraphExec, allocGraph, NULL, NULL, 0);
cudaGraphInstantiate(&freeGraphExec, freeGraph, NULL, NULL, 0);

cudaGraphLaunch(allocGraphExec, allocStream);

// establish the dependency of stream2 on the allocation node
// note: the dependency could also have been established with a stream synchronize operation
cudaEventRecord(allocEvent, allocStream)
cudaStreamWaitEvent(stream2, allocEvent);

kernel<<< …, stream2 >>> (dptr, …);

// establish the dependency between the stream 3 and the allocation use
cudaStreamRecordEvent(streamUseDoneEvent, stream2);
cudaStreamWaitEvent(stream3, streamUseDoneEvent);

// it is now safe to launch the freeing graph, which may also access the memory
cudaGraphLaunch(freeGraphExec, stream3);

Ordering established by using graph external event nodes:
通过使用图外部事件节点建立的排序:

void *dptr;
cudaEvent_t allocEvent; // event indicating when the allocation will be ready for use.
cudaEvent_t streamUseDoneEvent; // event indicating when the stream operations are done with the allocation.

// Contents of allocating graph with event record node
cudaGraphAddMemAllocNode(&allocNode, allocGraph, NULL, 0, &params);
dptr = params.dptr;
// note: this event record node depends on the alloc node
cudaGraphAddEventRecordNode(&recordNode, allocGraph, &allocNode, 1, allocEvent);
cudaGraphInstantiate(&allocGraphExec, allocGraph, NULL, NULL, 0);

// contents of consuming/freeing graph with event wait nodes
cudaGraphAddEventWaitNode(&streamUseDoneEventNode, waitAndFreeGraph, NULL, 0, streamUseDoneEvent);
cudaGraphAddEventWaitNode(&allocReadyEventNode, waitAndFreeGraph, NULL, 0, allocEvent);
nodeParams->kernelParams[0] = params.dptr;

// The allocReadyEventNode provides ordering with the alloc node for use in a consuming graph.
cudaGraphAddKernelNode(&kernelNode, waitAndFreeGraph, &allocReadyEventNode, 1, &nodeParams);

// The free node has to be ordered after both external and internal users.
// Thus the node must depend on both the kernelNode and the
// streamUseDoneEventNode.
dependencies[0] = kernelNode;
dependencies[1] = streamUseDoneEventNode;
cudaGraphAddMemFreeNode(&freeNode, waitAndFreeGraph, &dependencies, 2, dptr);
cudaGraphInstantiate(&waitAndFreeGraphExec, waitAndFreeGraph, NULL, NULL, 0);

cudaGraphLaunch(allocGraphExec, allocStream);

// establish the dependency of stream2 on the event node satisfies the ordering requirement
cudaStreamWaitEvent(stream2, allocEvent);
kernel<<< …, stream2 >>> (dptr, …);
cudaStreamRecordEvent(streamUseDoneEvent, stream2);

// the event wait node in the waitAndFreeGraphExec establishes the dependency on the “readyForFreeEvent” that is needed to prevent the kernel running in stream two from accessing the allocation after the free node in execution order.
cudaGraphLaunch(waitAndFreeGraphExec, stream3);

12.3.4. cudaGraphInstantiateFlagAutoFreeOnLaunch

Under normal circumstances, CUDA will prevent a graph from being relaunched if it has unfreed memory allocations because multiple allocations at the same address will leak memory. Instantiating a graph with the cudaGraphInstantiateFlagAutoFreeOnLaunch flag allows the graph to be relaunched while it still has unfreed allocations. In this case, the launch automatically inserts an asynchronous free of the unfreed allocations.
在正常情况下,CUDA 会阻止重新启动图形,如果它有未释放的内存分配,因为在相同地址进行多次分配会导致内存泄漏。使用 cudaGraphInstantiateFlagAutoFreeOnLaunch 标志实例化图形允许在仍有未释放分配时重新启动图形。在这种情况下,启动会自动插入对未释放分配的异步释放。

Auto free on launch is useful for single-producer multiple-consumer algorithms. At each iteration, a producer graph creates several allocations, and, depending on runtime conditions, a varying set of consumers accesses those allocations. This type of variable execution sequence means that consumers cannot free the allocations because a subsequent consumer may require access. Auto free on launch means that the launch loop does not need to track the producer’s allocations - instead, that information remains isolated to the producer’s creation and destruction logic. In general, auto free on launch simplifies an algorithm which would otherwise need to free all the allocations owned by a graph before each relaunch.
启动时自动释放对于单生产者多消费者算法非常有用。在每次迭代中,生产者图创建多个分配,并且根据运行时条件,不同的消费者访问这些分配。这种可变执行顺序的类型意味着消费者无法释放这些分配,因为后续的消费者可能需要访问。启动时自动释放意味着启动循环不需要跟踪生产者的分配 - 相反,这些信息保持在生产者的创建和销毁逻辑中。通常情况下,启动时自动释放简化了一个算法,否则需要在每次重新启动之前释放图形拥有的所有分配。

Note 注意

The cudaGraphInstantiateFlagAutoFreeOnLaunch flag does not change the behavior of graph destruction. The application must explicitly free the unfreed memory in order to avoid memory leaks, even for graphs instantiated with the flag. The following code shows the use of cudaGraphInstantiateFlagAutoFreeOnLaunch to simplify a single-producer / multiple-consumer algorithm:
` cudaGraphInstantiateFlagAutoFreeOnLaunch `标志不会改变图形销毁的行为。应用程序必须显式释放未释放的内存,以避免内存泄漏,即使对使用该标志实例化的图形也是如此。以下代码显示了使用` cudaGraphInstantiateFlagAutoFreeOnLaunch `来简化单生产者/多消费者算法的示例:

// Create producer graph which allocates memory and populates it with data
cudaStreamBeginCapture(cudaStreamPerThread, cudaStreamCaptureModeGlobal);
cudaMallocAsync(&data1, blocks * threads, cudaStreamPerThread);
cudaMallocAsync(&data2, blocks * threads, cudaStreamPerThread);
produce<<<blocks, threads, 0, cudaStreamPerThread>>>(data1, data2);
...
cudaStreamEndCapture(cudaStreamPerThread, &graph);
cudaGraphInstantiateWithFlags(&producer,
                              graph,
                              cudaGraphInstantiateFlagAutoFreeOnLaunch);
cudaGraphDestroy(graph);

// Create first consumer graph by capturing an asynchronous library call
cudaStreamBeginCapture(cudaStreamPerThread, cudaStreamCaptureModeGlobal);
consumerFromLibrary(data1, cudaStreamPerThread);
cudaStreamEndCapture(cudaStreamPerThread, &graph);
cudaGraphInstantiateWithFlags(&consumer1, graph, 0); //regular instantiation
cudaGraphDestroy(graph);

// Create second consumer graph
cudaStreamBeginCapture(cudaStreamPerThread, cudaStreamCaptureModeGlobal);
consume2<<<blocks, threads, 0, cudaStreamPerThread>>>(data2);
...
cudaStreamEndCapture(cudaStreamPerThread, &graph);
cudaGraphInstantiateWithFlags(&consumer2, graph, 0);
cudaGraphDestroy(graph);

// Launch in a loop
bool launchConsumer2 = false;
do {
    cudaGraphLaunch(producer, myStream);
    cudaGraphLaunch(consumer1, myStream);
    if (launchConsumer2) {
        cudaGraphLaunch(consumer2, myStream);
    }
} while (determineAction(&launchConsumer2));

cudaFreeAsync(data1, myStream);
cudaFreeAsync(data2, myStream);

cudaGraphExecDestroy(producer);
cudaGraphExecDestroy(consumer1);
cudaGraphExecDestroy(consumer2);

12.4. Optimized Memory Reuse
12.4. 优化内存重用 

CUDA reuses memory in two ways:
CUDA 以两种方式重复使用内存:

  • Virtual and physical memory reuse within a graph is based on virtual address assignment, like in the stream ordered allocator.
    图中的虚拟和物理内存重用是基于虚拟地址分配的,就像流有序分配器中一样。

  • Physical memory reuse between graphs is done with virtual aliasing: different graphs can map the same physical memory to their unique virtual addresses.
    图形之间的物理内存重用是通过虚拟别名完成的:不同的图形可以将相同的物理内存映射到它们各自独特的虚拟地址。

12.4.1. Address Reuse within a Graph
12.4.1. 图内地址重用 

CUDA may reuse memory within a graph by assigning the same virtual address ranges to different allocations whose lifetimes do not overlap. Since virtual addresses may be reused, pointers to different allocations with disjoint lifetimes are not guaranteed to be unique.
CUDA 可能通过将相同的虚拟地址范围分配给生命周期不重叠的不同分配来在图形内重用内存。由于虚拟地址可能会被重用,因此指向生命周期不重叠的不同分配的指针不能保证是唯一的。

The following figure shows adding a new allocation node (2) that can reuse the address freed by a dependent node (1).
下图显示添加一个新的分配节点(2),该节点可以重用由依赖节点(1)释放的地址。

Adding New Alloc Node 2

Figure 29 Adding New Alloc Node 2
图 29 添加新的 Alloc 节点 2 

The following figure shows adding a new alloc node (4). The new alloc node is not dependent on the free node (2) so cannot reuse the address from the associated alloc node (2). If the alloc node (2) used the address freed by free node (1), the new alloc node 3 would need a new address.
以下图显示添加一个新的分配节点(4)。新的分配节点不依赖于空闲节点(2),因此无法重用与关联的分配节点(2)相同的地址。如果分配节点(2)使用了被空闲节点(1)释放的地址,那么新的分配节点 3 将需要一个新地址。

Adding New Alloc Node 3

Figure 30 Adding New Alloc Node 3
图 30 添加新的 Alloc 节点 3 

12.4.2. Physical Memory Management and Sharing
12.4.2. 物理内存管理和共享 

CUDA is responsible for mapping physical memory to the virtual address before the allocating node is reached in GPU order. As an optimization for memory footprint and mapping overhead, multiple graphs may use the same physical memory for distinct allocations if they will not run simultaneously; however, physical pages cannot be reused if they are bound to more than one executing graph at the same time, or to a graph allocation which remains unfreed.
CUDA 负责在 GPU 顺序中到达分配节点之前将物理内存映射到虚拟地址。为了优化内存占用和映射开销,如果多个图形不会同时运行,则多个图形可以使用相同的物理内存进行不同的分配;但是,如果物理页面同时绑定到多个正在执行的图形,或者绑定到仍未释放的图形分配,那么物理页面就不能被重用。

CUDA may update physical memory mappings at any time during graph instantiation, launch, or execution. CUDA may also introduce synchronization between future graph launches in order to prevent live graph allocations from referring to the same physical memory. As for any allocate-free-allocate pattern, if a program accesses a pointer outside of an allocation’s lifetime, the erroneous access may silently read or write live data owned by another allocation (even if the virtual address of the allocation is unique). Use of compute sanitizer tools can catch this error.
CUDA 可能在图实例化、启动或执行过程中的任何时间更新物理内存映射。CUDA 也可能在未来图启动之间引入同步,以防止活动图分配引用相同的物理内存。对于任何分配-释放-分配模式,如果程序访问超出分配生命周期的指针,错误访问可能会悄悄读取或写入另一个分配拥有的活动数据(即使分配的虚拟地址是唯一的)。使用计算检查器工具可以捕获此错误。

The following figure shows graphs sequentially launched in the same stream. In this example, each graph frees all the memory it allocates. Since the graphs in the same stream never run concurrently, CUDA can and should use the same physical memory to satisfy all the allocations.
下图显示了在同一流中顺序启动的图。在这个示例中,每个图都会释放它分配的所有内存。由于同一流中的图永远不会并发运行,CUDA 可以并且应该使用相同的物理内存来满足所有分配。

Sequentially Launched Graphs

Figure 31 Sequentially Launched Graphs
图 31 顺序启动的图形 

12.5. Performance Considerations
12.5. 性能考虑 

When multiple graphs are launched into the same stream, CUDA attempts to allocate the same physical memory to them because the execution of these graphs cannot overlap. Physical mappings for a graph are retained between launches as an optimization to avoid the cost of remapping. If, at a later time, one of the graphs is launched such that its execution may overlap with the others (for example if it is launched into a different stream) then CUDA must perform some remapping because concurrent graphs require distinct memory to avoid data corruption.
当多个图形被启动到同一流中时,CUDA 会尝试为它们分配相同的物理内存,因为这些图形的执行不能重叠。图形的物理映射在启动之间保留,以避免重新映射的成本。如果在以后的某个时间,其中一个图形被启动,以便其执行可能与其他图形重叠(例如,如果它被启动到不同的流中),那么 CUDA 必须执行一些重新映射,因为并发图形需要不同的内存以避免数据损坏。

In general, remapping of graph memory in CUDA is likely caused by these operations:
通常,CUDA 中图形内存重映可能是由以下操作引起的:

  • Changing the stream into which a graph is launched
    更改启动图形的流

  • A trim operation on the graph memory pool, which explicitly frees unused memory (discussed in Physical Memory Footprint)
    图形内存池上的修剪操作,显式释放未使用的内存(在物理内存占用中讨论)

  • Relaunching a graph while an unfreed allocation from another graph is mapped to the same memory will cause a remap of memory before relaunch
    当在将另一个图的未释放分配映射到相同内存时重新启动图,将导致在重新启动之前重新映射内存

Remapping must happen in execution order, but after any previous execution of that graph is complete (otherwise memory that is still in use could be unmapped). Due to this ordering dependency, as well as because mapping operations are OS calls, mapping operations can be relatively expensive. Applications can avoid this cost by launching graphs containing allocation memory nodes consistently into the same stream.
重新映射必须按执行顺序进行,但必须在图的任何先前执行完成后进行(否则仍在使用的内存可能会被取消映射)。由于这种顺序依赖关系,以及映射操作是操作系统调用,映射操作可能相对昂贵。应用程序可以通过将包含分配内存节点的图一致地启动到相同的流中来避免这种成本。

12.5.1. First Launch / cudaGraphUpload
12.5.1. 首次启动 / cudaGraphUpload 

Physical memory cannot be allocated or mapped during graph instantiation because the stream in which the graph will execute is unknown. Mapping is done instead during graph launch. Calling cudaGraphUpload can separate out the cost of allocation from the launch by performing all mappings for that graph immediately and associating the graph with the upload stream. If the graph is then launched into the same stream, it will launch without any additional remapping.
在图实例化期间无法分配或映射物理内存,因为执行图的流是未知的。映射是在图启动期间完成的。通过立即为该图执行所有映射并将图与上传流关联,调用 cudaGraphUpload 可以将分配成本与启动分开。然后,如果将图启动到相同的流中,它将在无需额外重新映射的情况下启动。

Using different streams for graph upload and graph launch behaves similarly to switching streams, likely resulting in remap operations. In addition, unrelated memory pool management is permitted to pull memory from an idle stream, which could negate the impact of the uploads.
使用不同的流来上传图形和启动图形的行为类似于切换流,可能导致重新映射操作。此外,允许从空闲流中提取内存的不相关内存池管理可能抵消上传的影响。

12.6. Physical Memory Footprint
12.6. 物理内存占用

The pool-management behavior of asynchronous allocation means that destroying a graph which contains memory nodes (even if their allocations are free) will not immediately return physical memory to the OS for use by other processes. To explicitly release memory back to the OS, an application should use the cudaDeviceGraphMemTrim API.
异步分配的池管理行为意味着销毁包含内存节点的图形(即使它们的分配是免费的)不会立即将物理内存返回给操作系统以供其他进程使用。要显式将内存释放回操作系统,应用程序应使用 cudaDeviceGraphMemTrim API。

cudaDeviceGraphMemTrim will unmap and release any physical memory reserved by graph memory nodes that is not actively in use. Allocations that have not been freed and graphs that are scheduled or running are considered to be actively using the physical memory and will not be impacted. Use of the trim API will make physical memory available to other allocation APIs and other applications or processes, but will cause CUDA to reallocate and remap memory when the trimmed graphs are next launched. Note that cudaDeviceGraphMemTrim operates on a different pool from cudaMemPoolTrimTo(). The graph memory pool is not exposed to the steam ordered memory allocator. CUDA allows applications to query their graph memory footprint through the cudaDeviceGetGraphMemAttribute API. Querying the attribute cudaGraphMemAttrReservedMemCurrent returns the amount of physical memory reserved by the driver for graph allocations in the current process. Querying cudaGraphMemAttrUsedMemCurrent returns the amount of physical memory currently mapped by at least one graph. Either of these attributes can be used to track when new physical memory is acquired by CUDA for the sake of an allocating graph. Both of these attributes are useful for examining how much memory is saved by the sharing mechanism.
cudaDeviceGraphMemTrim 将取消映射并释放图形内存节点保留的未被使用的物理内存。未被释放的分配和已计划或正在运行的图形被视为在积极使用物理内存,不会受到影响。使用 trim API 将使物理内存可供其他分配 API 和其他应用程序或进程使用,但会导致 CUDA 在下次启动被修剪的图形时重新分配和重新映射内存。请注意, cudaDeviceGraphMemTrimcudaMemPoolTrimTo() 操作的是不同的池。图形内存池不暴露给 Steam 有序内存分配器。CUDA 允许应用程序通过 cudaDeviceGetGraphMemAttribute API 查询其图形内存占用。查询属性 cudaGraphMemAttrReservedMemCurrent 返回驱动程序为当前进程的图形分配保留的物理内存量。查询 cudaGraphMemAttrUsedMemCurrent 返回至少一个图形当前映射的物理内存量。这两个属性中的任何一个都可用于跟踪 CUDA 为分配图形而获取新物理内存的情况。这两个属性对于检查共享机制节省了多少内存非常有用。

12.7. Peer Access
12.7. 对等访问 

Graph allocations can be configured for access from multiple GPUs, in which case CUDA will map the allocations onto the peer GPUs as required. CUDA allows graph allocations requiring different mappings to reuse the same virtual address. When this occurs, the address range is mapped onto all GPUs required by the different allocations. This means an allocation may sometimes allow more peer access than was requested during its creation; however, relying on these extra mappings is still an error.
图形分配可以配置为从多个 GPU 访问,此时 CUDA 将根据需要将分配映射到对等 GPU 上。CUDA 允许需要不同映射的图形分配重用相同的虚拟地址。当这种情况发生时,地址范围将映射到不同分配所需的所有 GPU 上。这意味着分配有时可能允许比其创建过程中请求的对等访问更多;但是,依赖这些额外的映射仍然是一个错误。

12.7.1. Peer Access with Graph Node APIs
12.7.1. 使用 Graph Node APIs 进行对等访问 

The cudaGraphAddMemAllocNode API accepts mapping requests in the accessDescs array field of the node parameters structures. The poolProps.location embedded structure specifies the resident device for the allocation. Access from the allocating GPU is assumed to be needed, thus the application does not need to specify an entry for the resident device in the accessDescs array.
cudaGraphAddMemAllocNode API 接受节点参数结构的 accessDescs 数组字段中的映射请求。 poolProps.location 嵌入结构指定了分配的驻留设备。假定需要从分配的 GPU 访问,因此应用程序不需要在 accessDescs 数组中为驻留设备指定条目。

cudaMemAllocNodeParams params = {};
params.poolProps.allocType = cudaMemAllocationTypePinned;
params.poolProps.location.type = cudaMemLocationTypeDevice;
// specify device 1 as the resident device
params.poolProps.location.id = 1;
params.bytesize = size;

// allocate an allocation resident on device 1 accessible from device 1
cudaGraphAddMemAllocNode(&allocNode, graph, NULL, 0, &params);

accessDescs[2];
// boilerplate for the access descs (only ReadWrite and Device access supported by the add node api)
accessDescs[0].flags = cudaMemAccessFlagsProtReadWrite;
accessDescs[0].location.type = cudaMemLocationTypeDevice;
accessDescs[1].flags = cudaMemAccessFlagsProtReadWrite;
accessDescs[1].location.type = cudaMemLocationTypeDevice;

// access being requested for device 0 & 2.  Device 1 access requirement left implicit.
accessDescs[0].location.id = 0;
accessDescs[1].location.id = 2;

// access request array has 2 entries.
params.accessDescCount = 2;
params.accessDescs = accessDescs;

// allocate an allocation resident on device 1 accessible from devices 0, 1 and 2. (0 & 2 from the descriptors, 1 from it being the resident device).
cudaGraphAddMemAllocNode(&allocNode, graph, NULL, 0, &params);

12.7.2. Peer Access with Stream Capture
12.7.2. 使用流捕获进行对等访问 

For stream capture, the allocation node records the peer accessibility of the allocating pool at the time of the capture. Altering the peer accessibility of the allocating pool after a cudaMallocFromPoolAsync call is captured does not affect the mappings that the graph will make for the allocation.
对于流捕获,分配节点记录在捕获时分配池的对等可访问性。在捕获 cudaMallocFromPoolAsync 调用后更改分配池的对等可访问性不会影响图形为分配所做的映射。

// boilerplate for the access descs (only ReadWrite and Device access supported by the add node api)
accessDesc.flags = cudaMemAccessFlagsProtReadWrite;
accessDesc.location.type = cudaMemLocationTypeDevice;
accessDesc.location.id = 1;

// let memPool be resident and accessible on device 0

cudaStreamBeginCapture(stream);
cudaMallocAsync(&dptr1, size, memPool, stream);
cudaStreamEndCapture(stream, &graph1);

cudaMemPoolSetAccess(memPool, &accessDesc, 1);

cudaStreamBeginCapture(stream);
cudaMallocAsync(&dptr2, size, memPool, stream);
cudaStreamEndCapture(stream, &graph2);

//The graph node allocating dptr1 would only have the device 0 accessibility even though memPool now has device 1 accessibility.
//The graph node allocating dptr2 will have device 0 and device 1 accessibility, since that was the pool accessibility at the time of the cudaMallocAsync call.

13. Mathematical Functions
13. 数学函数 

The reference manual lists, along with their description, all the functions of the C/C++ standard library mathematical functions that are supported in device code, as well as all intrinsic functions (that are only supported in device code).
参考手册列出了 C/C++标准库数学函数的所有函数,以及它们的描述,这些函数在设备代码中受支持,以及所有内部函数(仅在设备代码中受支持)。

This section provides accuracy information for some of these functions when applicable. It uses ULP for quantification. For further information on the definition of the Unit in the Last Place (ULP), please see Jean-Michel Muller’s paper On the definition of ulp(x), RR-5504, LIP RR-2005-09, INRIA, LIP. 2005, pp.16 at https://hal.inria.fr/inria-00070503/document.
本节提供了一些函数的准确性信息(适用时)。它使用 ULP 进行量化。有关最后一位单位(ULP)定义的更多信息,请参阅 Jean-Michel Muller 的论文《关于 ulp(x)的定义》,RR-5504,LIP RR-2005-09,INRIA,LIP。2005 年,第 16 页,网址:https://hal.inria.fr/inria-00070503/document。

Mathematical functions supported in device code do not set the global errno variable, nor report any floating-point exceptions to indicate errors; thus, if error diagnostic mechanisms are required, the user should implement additional screening for inputs and outputs of the functions. The user is responsible for the validity of pointer arguments. The user must not pass uninitialized parameters to the Mathematical functions as this may result in undefined behavior: functions are inlined in the user program and thus are subject to compiler optimizations.
设备代码中支持的数学函数不会设置全局 errno 变量,也不会报告任何浮点异常以指示错误;因此,如果需要错误诊断机制,则用户应为函数的输入和输出实现额外的筛选。用户对指针参数的有效性负责。用户不得将未初始化的参数传递给数学函数,因为这可能导致未定义的行为:函数会内联到用户程序中,因此会受到编译器优化的影响。

13.1. Standard Functions
13.1. 标准函数 

The functions from this section can be used in both host and device code.
本节中的函数可在主机代码和设备代码中使用。

This section specifies the error bounds of each function when executed on the device and also when executed on the host in the case where the host does not supply the function.
本节指定了在设备上执行时每个函数的误差界限,以及在主机上执行时的误差界限,假设主机未提供函数的情况。

The error bounds are generated from extensive but not exhaustive tests, so they are not guaranteed bounds.
错误边界是从广泛但不是详尽的测试中生成的,因此它们不是保证的边界。

Single-Precision Floating-Point Functions
单精度浮点函数

Addition and multiplication are IEEE-compliant, so have a maximum error of 0.5 ulp.
加法和乘法符合 IEEE 标准,因此最大误差为 0.5 ulp。

The recommended way to round a single-precision floating-point operand to an integer, with the result being a single-precision floating-point number is rintf(), not roundf(). The reason is that roundf() maps to a 4-instruction sequence on the device, whereas rintf() maps to a single instruction. truncf(), ceilf(), and floorf() each map to a single instruction as well.
推荐的方法是将单精度浮点操作数舍入为整数,结果是单精度浮点数 rintf() ,而不是 roundf() 。原因是 roundf() 在设备上映射为 4 条指令序列,而 rintf() 映射为单条指令。 truncf()ceilf()floorf() 各自也映射为单条指令。

Table 13 Single-Precision Mathematical Standard Library Functions with Maximum ULP Error. The maximum error is stated as the absolute value of the difference in ulps between the result returned by the CUDA library function and a correctly rounded single-precision result obtained according to the round-to-nearest ties-to-even rounding mode.
表 13 单精度数学标准库函数的最大 ULP 误差。最大误差被说明为 CUDA 库函数返回的结果与根据最接近的舍入模式获得的正确舍入的单精度结果之间的 ulp 差的绝对值。 

Function 功能

Maximum ulp error 最大 ulp 误差

x+y

0 (IEEE-754 round-to-nearest-even)
0(IEEE-754 四舍五入到最接近的偶数)

x*y

0 (IEEE-754 round-to-nearest-even)
0(IEEE-754 四舍五入到最接近的偶数)

x/y

0 for compute capability 2 when compiled with -prec-div=true
编译时使用 -prec-div=true 时,计算能力为 2

2 (full range), otherwise
2(完整范围),否则

1/x

0 for compute capability 2 when compiled with -prec-div=true
编译时使用 -prec-div=true 时,计算能力为 2

1 (full range), otherwise
1(完整范围),否则

rsqrtf(x)

1/sqrtf(x)

2 (full range) 2(全范围)

Applies to 1/sqrtf(x) only when it is converted to rsqrtf(x) by the compiler.
仅当 1/sqrtf(x) 被编译器转换为 rsqrtf(x) 时适用。

sqrtf(x)

0 when compiled with -prec-sqrt=true
当与 -prec-sqrt=true 一起编译时为 0

Otherwise 1 for compute capability 5.2
否则为计算能力 5.2

and 3 for older architectures
对于较旧的架构,以及 3

cbrtf(x)

1 (full range) 1(全范围)

rcbrtf(x)

1 (full range) 1(全范围)

hypotf(x,y)

3 (full range) 3(全范围)

rhypotf(x,y)

2 (full range) 2(全范围)

norm3df(x,y,z)

3 (full range) 3(全范围)

rnorm3df(x,y,z)

2 (full range) 2(全范围)

norm4df(x,y,z,t)

3 (full range) 3(全范围)

rnorm4df(x,y,z,t)

2 (full range) 2(全范围)

normf(dim,arr)

An error bound cannot be provided because a fast algorithm is used with accuracy loss due to round-off. .
由于四舍五入导致精度损失,无法提供错误边界。

rnormf(dim,arr)

An error bound cannot be provided because a fast algorithm is used with accuracy loss due to round-off. .
由于四舍五入导致精度损失,无法提供错误边界。

expf(x)

2 (full range) 2(全范围)

exp2f(x)

2 (full range) 2(全范围)

exp10f(x)

2 (full range) 2(全范围)

expm1f(x)

1 (full range) 1(全范围)

logf(x)

1 (full range) 1(全范围)

log2f(x)

1 (full range) 1(全范围)

log10f(x)

2 (full range) 2(全范围)

log1pf(x)

1 (full range) 1(全范围)

sinf(x)

2 (full range) 2(全范围)

cosf(x)

2 (full range) 2(全范围)

tanf(x)

4 (full range) 4(全范围)

sincosf(x,sptr,cptr)

2 (full range) 2(全范围)

sinpif(x)

1 (full range) 1(全范围)

cospif(x)

1 (full range) 1(全范围)

sincospif(x,sptr,cptr)

1 (full range) 1(全范围)

asinf(x)

2 (full range) 2(全范围)

acosf(x)

2 (full range) 2(全范围)

atanf(x)

2 (full range) 2(全范围)

atan2f(y,x)

3 (full range) 3(全范围)

sinhf(x)

3 (full range) 3(全范围)

coshf(x)

2 (full range) 2(全范围)

tanhf(x)

2 (full range) 2(全范围)

asinhf(x)

3 (full range) 3(全范围)

acoshf(x)

4 (full range) 4(全范围)

atanhf(x)

3 (full range) 3(全范围)

powf(x,y)

4 (full range) 4(全范围)

erff(x)

2 (full range) 2(全范围)

erfcf(x)

4 (full range) 4(全范围)

erfinvf(x)

2 (full range) 2(全范围)

erfcinvf(x)

4 (full range) 4(全范围)

erfcxf(x)

4 (full range) 4(全范围)

normcdff(x)

5 (full range) 5(全范围)

normcdfinvf(x)

5 (full range) 5(全范围)

lgammaf(x)

6 (outside interval -10.001 … -2.264; larger inside)
6(在区间-10.001 ... -2.264 之外;内部较大)

tgammaf(x)

5 (full range) 5(全范围)

fmaf(x,y,z)

0 (full range) 0(完整范围)

frexpf(x,exp)

0 (full range) 0(完整范围)

ldexpf(x,exp)

0 (full range) 0(完整范围)

scalbnf(x,n)

0 (full range) 0(完整范围)

scalblnf(x,l)

0 (full range) 0(完整范围)

logbf(x)

0 (full range) 0(完整范围)

ilogbf(x)

0 (full range) 0(完整范围)

j0f(x)

9 for |x| < 8
9 对于|x| < 8

otherwise, the maximum absolute error is 2.2 x 10-6
否则,最大绝对误差为 2.2 x 10 -6

j1f(x)

9 for |x| < 8
9 对于|x| < 8

otherwise, the maximum absolute error is 2.2 x 10-6
否则,最大绝对误差为 2.2 x 10 -6

jnf(n,x)

For n = 128, the maximum absolute error is 2.2 x 10-6
对于 n = 128,最大绝对误差为 2.2 x 10 -6

y0f(x)

9 for |x| < 8
9 对于|x| < 8

otherwise, the maximum absolute error is 2.2 x 10-6
否则,最大绝对误差为 2.2 x 10 -6

y1f(x)

9 for |x| < 8
9 对于|x| < 8

otherwise, the maximum absolute error is 2.2 x 10-6
否则,最大绝对误差为 2.2 x 10 -6

ynf(n,x)

ceil(2 + 2.5n) for |x| < n
ceil(2 + 2.5n) 对于 |x| < n

otherwise, the maximum absolute error is 2.2 x 10-6
否则,最大绝对误差为 2.2 x 10 -6

cyl_bessel_i0f(x)

6 (full range) 6(全范围)

cyl_bessel_i1f(x)

6 (full range) 6(全范围)

fmodf(x,y)

0 (full range) 0(完整范围)

remainderf(x,y)

0 (full range) 0(完整范围)

remquof(x,y,iptr)

0 (full range) 0(完整范围)

modff(x,iptr)

0 (full range) 0(完整范围)

fdimf(x,y)

0 (full range) 0(完整范围)

truncf(x)

0 (full range) 0(完整范围)

roundf(x)

0 (full range) 0(完整范围)

rintf(x)

0 (full range) 0(完整范围)

nearbyintf(x)

0 (full range) 0(完整范围)

ceilf(x)

0 (full range) 0(完整范围)

floorf(x)

0 (full range) 0(完整范围)

lrintf(x)

0 (full range) 0(完整范围)

lroundf(x)

0 (full range) 0(完整范围)

llrintf(x)

0 (full range) 0(完整范围)

llroundf(x)

0 (full range) 0(完整范围)

Double-Precision Floating-Point Functions
双精度浮点函数

The recommended way to round a double-precision floating-point operand to an integer, with the result being a double-precision floating-point number is rint(), not round(). The reason is that round() maps to a 5-instruction sequence on the device, whereas rint() maps to a single instruction. trunc(), ceil(), and floor() each map to a single instruction as well.
将双精度浮点操作数舍入为整数的推荐方法,结果是双精度浮点数,而不是整数。原因是 rint() 映射到设备上的 5 条指令序列,而 round() 映射到单个指令。 round()rint()trunc() 各自也映射到单个指令。

Table 14 Double-Precision Mathematical Standard Library Functions with Maximum ULP Error. The maximum error is stated as the absolute value of the difference in ulps between the result returned by the CUDA library function and a correctly rounded double-precision result obtained according to the round-to-nearest ties-to-even rounding mode.
表 14 双精度数学标准库函数的最大 ULP 误差。最大误差被规定为 CUDA 库函数返回的结果与根据最接近的偶数舍入模式获得的正确舍入的双精度结果之间的 ULP 差的绝对值。 

Function 功能

Maximum ulp error 最大 ulp 误差

x+y

0 (IEEE-754 round-to-nearest-even)
0(IEEE-754 四舍五入到最接近的偶数)

x*y

0 (IEEE-754 round-to-nearest-even)
0(IEEE-754 四舍五入到最接近的偶数)

x/y

0 (IEEE-754 round-to-nearest-even)
0(IEEE-754 四舍五入到最接近的偶数)

1/x

0 (IEEE-754 round-to-nearest-even)
0(IEEE-754 四舍五入到最接近的偶数)

sqrt(x)

0 (IEEE-754 round-to-nearest-even)
0(IEEE-754 四舍五入到最接近的偶数)

rsqrt(x)

1 (full range) 1(全范围)

cbrt(x)

1 (full range) 1(全范围)

rcbrt(x)

1 (full range) 1(全范围)

hypot(x,y)

2 (full range) 2(全范围)

rhypot(x,y)

1 (full range) 1(全范围)

norm3d(x,y,z)

2 (full range) 2(全范围)

rnorm3d(x,y,z)

1 (full range) 1(全范围)

norm4d(x,y,z,t)

2 (full range) 2(全范围)

rnorm4d(x,y,z,t)

1 (full range) 1(全范围)

norm(dim,arr)

An error bound cannot be provided because a fast algorithm is used with accuracy loss due to round-off.
由于四舍五入导致精度损失,无法提供错误边界,因为使用了快速算法。

rnorm(dim,arr)

An error bound cannot be provided because a fast algorithm is used with accuracy loss due to round-off.
由于四舍五入导致精度损失,无法提供错误边界,因为使用了快速算法。

exp(x)

1 (full range) 1(全范围)

exp2(x)

1 (full range) 1(全范围)

exp10(x)

1 (full range) 1(全范围)

expm1(x)

1 (full range) 1(全范围)

log(x)

1 (full range) 1(全范围)

log2(x)

1 (full range) 1(全范围)

log10(x)

1 (full range) 1(全范围)

log1p(x)

1 (full range) 1(全范围)

sin(x)

2 (full range) 2(全范围)

cos(x)

2 (full range) 2(全范围)

tan(x)

2 (full range) 2(全范围)

sincos(x,sptr,cptr)

2 (full range) 2(全范围)

sinpi(x)

2 (full range) 2(全范围)

cospi(x)

2 (full range) 2(全范围)

sincospi(x,sptr,cptr)

2 (full range) 2(全范围)

asin(x)

2 (full range) 2(全范围)

acos(x)

2 (full range) 2(全范围)

atan(x)

2 (full range) 2(全范围)

atan2(y,x)

2 (full range) 2(全范围)

sinh(x)

2 (full range) 2(全范围)

cosh(x)

1 (full range) 1(全范围)

tanh(x)

1 (full range) 1(全范围)

asinh(x)

3 (full range) 3(全范围)

acosh(x)

3 (full range) 3(全范围)

atanh(x)

2 (full range) 2(全范围)

pow(x,y)

2 (full range) 2(全范围)

erf(x)

2 (full range) 2(全范围)

erfc(x)

5 (full range) 5(全范围)

erfinv(x)

5 (full range) 5(全范围)

erfcinv(x)

6 (full range) 6(全范围)

erfcx(x)

4 (full range) 4(全范围)

normcdf(x)

5 (full range) 5(全范围)

normcdfinv(x)

8 (full range) 8(全范围)

lgamma(x)

4 (outside interval -23.0001 … -2.2637; larger inside)
4(在区间-23.0001 ... -2.2637 之外;内部更大)

tgamma(x)

10 (full range) 10(全范围)

fma(x,y,z)

0 (IEEE-754 round-to-nearest-even)
0(IEEE-754 四舍五入到最接近的偶数)

frexp(x,exp)

0 (full range) 0(完整范围)

ldexp(x,exp)

0 (full range) 0(完整范围)

scalbn(x,n)

0 (full range) 0(完整范围)

scalbln(x,l)

0 (full range) 0(完整范围)

logb(x)

0 (full range) 0(完整范围)

ilogb(x)

0 (full range) 0(完整范围)

j0(x)

7 for |x| < 8
7 对于|x| < 8

otherwise, the maximum absolute error is 5 x 10-12
否则,最大绝对误差为 5 x 10 -12

j1(x)

7 for |x| < 8
7 对于|x| < 8

otherwise, the maximum absolute error is 5 x 10-12
否则,最大绝对误差为 5 x 10 -12

jn(n,x)

For n = 128, the maximum absolute error is 5 x 10-12
对于 n = 128,最大绝对误差为 5 x 10 -12

y0(x)

7 for |x| < 8
7 对于|x| < 8

otherwise, the maximum absolute error is 5 x 10-12
否则,最大绝对误差为 5 x 10 -12

y1(x)

7 for |x| < 8
7 对于|x| < 8

otherwise, the maximum absolute error is 5 x 10-12
否则,最大绝对误差为 5 x 10 -12

yn(n,x)

For |x| > 1.5n, the maximum absolute error is 5 x 10-12
对于 |x| > 1.5n,最大绝对误差为 5 x 10 -12

cyl_bessel_i0(x)

6 (full range) 6(全范围)

cyl_bessel_i1(x)

6 (full range) 6(全范围)

fmod(x,y)

0 (full range) 0(完整范围)

remainder(x,y)

0 (full range) 0(完整范围)

remquo(x,y,iptr)

0 (full range) 0(完整范围)

modf(x,iptr)

0 (full range) 0(完整范围)

fdim(x,y)

0 (full range) 0(完整范围)

trunc(x)

0 (full range) 0(完整范围)

round(x)

0 (full range) 0(完整范围)

rint(x)

0 (full range) 0(完整范围)

nearbyint(x)

0 (full range) 0(完整范围)

ceil(x)

0 (full range) 0(完整范围)

floor(x)

0 (full range) 0(完整范围)

lrint(x)

0 (full range) 0(完整范围)

lround(x)

0 (full range) 0(完整范围)

llrint(x)

0 (full range) 0(完整范围)

llround(x)

0 (full range) 0(完整范围)

13.2. Intrinsic Functions
13.2. 内置函数 

The functions from this section can only be used in device code.
此部分的函数只能在设备代码中使用。

Among these functions are the less accurate, but faster versions of some of the functions of Standard Functions. They have the same name prefixed with __ (such as __sinf(x)). They are faster as they map to fewer native instructions. The compiler has an option (-use_fast_math) that forces each function in Table 15 to compile to its intrinsic counterpart. In addition to reducing the accuracy of the affected functions, it may also cause some differences in special case handling. A more robust approach is to selectively replace mathematical function calls by calls to intrinsic functions only where it is merited by the performance gains and where changed properties such as reduced accuracy and different special case handling can be tolerated.
在这些功能中,有一些不太准确但更快的标准函数的版本。它们的名称相同,前缀为 __ (例如 __sinf(x) )。它们更快,因为它们映射到更少的本机指令。编译器有一个选项( -use_fast_math ),可以强制表 15 中的每个函数编译为其固有对应项。除了降低受影响函数的准确性外,还可能导致一些特殊情况处理上的差异。更健壮的方法是有选择性地将数学函数调用替换为仅在性能收益值得的情况下调用固有函数,并且可以容忍降低准确性和不同特殊情况处理等更改属性。

Table 15 Functions Affected by -use_fast_math
表 15 使用 -use_fast_math 影响的函数 

Operator/Function 运算符/函数

Device Function 设备功能

x/y

__fdividef(x,y)

sinf(x)

__sinf(x)

cosf(x)

__cosf(x)

tanf(x)

__tanf(x)

sincosf(x,sptr,cptr)

__sincosf(x,sptr,cptr)

logf(x)

__logf(x)

log2f(x)

__log2f(x)

log10f(x)

__log10f(x)

expf(x)

__expf(x)

exp10f(x)

__exp10f(x)

powf(x,y)

__powf(x,y)

Single-Precision Floating-Point Functions
单精度浮点函数

__fadd_[rn,rz,ru,rd]() and __fmul_[rn,rz,ru,rd]() map to addition and multiplication operations that the compiler never merges into FMADs. By contrast, additions and multiplications generated from the ‘*’ and ‘+’ operators will frequently be combined into FMADs.
__fadd_[rn,rz,ru,rd]()__fmul_[rn,rz,ru,rd]() 分别映射到编译器不会合并为 FMADs 的加法和乘法操作。相比之下,从 '*' 和 '+' 运算符生成的加法和乘法将经常被合并为 FMADs。

Functions suffixed with _rn operate using the round to nearest even rounding mode.
带有 _rn 后缀的函数使用最接近的偶数舍入模式进行操作。

Functions suffixed with _rz operate using the round towards zero rounding mode.
带有 _rz 后缀的函数使用朝零舍入模式运行。

Functions suffixed with _ru operate using the round up (to positive infinity) rounding mode.
带有 _ru 后缀的函数使用向上取整(到正无穷大)的舍入模式运行。

Functions suffixed with _rd operate using the round down (to negative infinity) rounding mode.
带有 _rd 后缀的函数使用向下取整(向负无穷大取整)模式运行。

The accuracy of floating-point division varies depending on whether the code is compiled with -prec-div=false or -prec-div=true. When the code is compiled with -prec-div=false, both the regular division / operator and __fdividef(x,y) have the same accuracy, but for 2126 < |y| < 2128, __fdividef(x,y) delivers a result of zero, whereas the / operator delivers the correct result to within the accuracy stated in Table 16. Also, for 2126 < |y| < 2128, if x is infinity, __fdividef(x,y) delivers a NaN (as a result of multiplying infinity by zero), while the / operator returns infinity. On the other hand, the / operator is IEEE-compliant when the code is compiled with -prec-div=true or without any -prec-div option at all since its default value is true.
浮点除法的准确性取决于代码是使用 -prec-div=false 还是 -prec-div=true 编译。当代码使用 -prec-div=false 编译时,常规除法 / 运算符和 __fdividef(x,y) 具有相同的准确性,但对于 2 126 < |y| < 2 128__fdividef(x,y) 会返回零,而 / 运算符会返回表 16 中所述准确性范围内的正确结果。此外,对于 2 126 < |y| < 2 128 ,如果 x 是无穷大, __fdividef(x,y) 会返回 NaN (因为无穷大乘以零),而 / 运算符会返回无穷大。另一方面,当代码使用 -prec-div=true 编译或根本没有 -prec-div 选项时, / 运算符符合 IEEE 标准,因为其默认值为 true。

Table 16 Single-Precision Floating-Point Intrinsic Functions. (Supported by the CUDA Runtime Library with Respective Error Bounds)
表 16 单精度浮点内置函数。 (由 CUDA 运行时库支持,具有相应的误差界限) 

Function 功能

Error bounds 误差界限

__fadd_[rn,rz,ru,rd](x,y)

IEEE-compliant. 符合 IEEE 标准。

__fsub_[rn,rz,ru,rd](x,y)

IEEE-compliant. 符合 IEEE 标准。

__fmul_[rn,rz,ru,rd](x,y)

IEEE-compliant. 符合 IEEE 标准。

__fmaf_[rn,rz,ru,rd](x,y,z)

IEEE-compliant. 符合 IEEE 标准。

__frcp_[rn,rz,ru,rd](x)

IEEE-compliant. 符合 IEEE 标准。

__fsqrt_[rn,rz,ru,rd](x)

IEEE-compliant. 符合 IEEE 标准。

__frsqrt_rn(x)

IEEE-compliant. 符合 IEEE 标准。

__fdiv_[rn,rz,ru,rd](x,y)

IEEE-compliant. 符合 IEEE 标准。

__fdividef(x,y)

For |y| in [2126,2126], the maximum ulp error is 2.
对于 [ 2126,2126 ] 中的 |y| ,最大 ulp 误差为 2。

__expf(x)

The maximum ulp error is 2 + floor(abs(1.173 * x)).
最大 ulp 误差为 2 + floor(abs(1.173 * x))

__exp10f(x)

The maximum ulp error is 2 + floor(abs(2.97 * x)).
最大 ulp 误差为 2 + floor(abs(2.97 * x))

__logf(x)

For x in [0.5, 2], the maximum absolute error is 221.41, otherwise, the maximum ulp error is 3.
对于 x 在[0.5, 2]范围内,最大绝对误差为 221.41 ,否则,最大 ulp 误差为 3。

__log2f(x)

For x in [0.5, 2], the maximum absolute error is 222, otherwise, the maximum ulp error is 2.
对于 x 在[0.5, 2]范围内,最大绝对误差为 222 ,否则,最大 ulp 误差为 2。

__log10f(x)

For x in [0.5, 2], the maximum absolute error is 224, otherwise, the maximum ulp error is 3.
对于 x 在[0.5, 2]范围内,最大绝对误差为 224 ,否则,最大 ulp 误差为 3。

__sinf(x)

For x in [π,π], the maximum absolute error is 221.41, and larger otherwise.
对于 [ π,π ] 中的 x ,最大绝对误差为 221.41 ,否则更大。

__cosf(x)

For x in [π,π], the maximum absolute error is 221.19, and larger otherwise.
对于 [ π,π ] 中的 x ,最大绝对误差为 221.19 ,否则更大。

__sincosf(x,sptr,cptr)

Same as __sinf(x) and __cosf(x).
__sinf(x)__cosf(x) 相同。

__tanf(x)

Derived from its implementation as __sinf(x) * (1/__cosf(x)).
从其实现派生为 __sinf(x) * (1/__cosf(x))

__powf(x, y)

Derived from its implementation as exp2f(y * __log2f(x)).
从其实现派生为 exp2f(y * __log2f(x))

Double-Precision Floating-Point Functions
双精度浮点函数

__dadd_rn() and __dmul_rn() map to addition and multiplication operations that the compiler never merges into FMADs. By contrast, additions and multiplications generated from the ‘*’ and ‘+’ operators will frequently be combined into FMADs.
__dadd_rn()__dmul_rn() 分别映射到编译器不会合并为 FMADs 的加法和乘法操作。相比之下,从 '*' 和 '+' 运算符生成的加法和乘法将经常被合并为 FMADs。

Table 17 Double-Precision Floating-Point Intrinsic Functions. (Supported by the CUDA Runtime Library with Respective Error Bounds)
表 17 双精度浮点内置函数。 (由 CUDA 运行时库支持,具有相应的误差界限) 

Function 功能

Error bounds 误差界限

__dadd_[rn,rz,ru,rd](x,y)

IEEE-compliant. 符合 IEEE 标准。

__dsub_[rn,rz,ru,rd](x,y)

IEEE-compliant. 符合 IEEE 标准。

__dmul_[rn,rz,ru,rd](x,y)

IEEE-compliant. 符合 IEEE 标准。

__fma_[rn,rz,ru,rd](x,y,z)

IEEE-compliant. 符合 IEEE 标准。

__ddiv_[rn,rz,ru,rd](x,y)(x,y)

IEEE-compliant. 符合 IEEE 标准。

Requires compute capability > 2.
需要计算能力> 2。

__drcp_[rn,rz,ru,rd](x)

IEEE-compliant. 符合 IEEE 标准。

Requires compute capability > 2.
需要计算能力> 2。

__dsqrt_[rn,rz,ru,rd](x)

IEEE-compliant. 符合 IEEE 标准。

Requires compute capability > 2.
需要计算能力> 2。

14. C++ Language Support
14. C++ 语言支持 

As described in Compilation with NVCC, CUDA source files compiled with nvcc can include a mix of host code and device code. The CUDA front-end compiler aims to emulate the host compiler behavior with respect to C++ input code. The input source code is processed according to the C++ ISO/IEC 14882:2003, C++ ISO/IEC 14882:2011, C++ ISO/IEC 14882:2014 or C++ ISO/IEC 14882:2017 specifications, and the CUDA front-end compiler aims to emulate any host compiler divergences from the ISO specification. In addition, the supported language is extended with CUDA-specific constructs described in this document 13, and is subject to the restrictions described below.
如《使用 NVCC 进行编译》中所述,使用 nvcc 编译的 CUDA 源文件可以包含主机代码和设备代码的混合。CUDA 前端编译器旨在模拟主机编译器对于 C++ 输入代码的行为。输入源代码根据 C++ ISO/IEC 14882:2003、C++ ISO/IEC 14882:2011、C++ ISO/IEC 14882:2014 或 C++ ISO/IEC 14882:2017 规范进行处理,CUDA 前端编译器旨在模拟主机编译器与 ISO 规范的任何偏差。此外,支持的语言通过本文档中描述的 CUDA 特定构造进行扩展,并受到以下描述的限制。

C++11 Language Features, C++14 Language Features and C++17 Language Features provide support matrices for the C++11, C++14, C++17 and C++20 features, respectively. Restrictions lists the language restrictions. Polymorphic Function Wrappers and Extended Lambdas describe additional features. Code Samples gives code samples.
C++11 语言特性,C++14 语言特性和 C++17 语言特性分别提供了对 C++11,C++14,C++17 和 C++20 特性的支持矩阵。 限制列出了语言限制。 多态函数包装器和扩展 Lambda 描述了额外的功能。 代码示例提供了代码示例。

14.1. C++11 Language Features
14.1. C++11 语言特性 

The following table lists new language features that have been accepted into the C++11 standard. The “Proposal” column provides a link to the ISO C++ committee proposal that describes the feature, while the “Available in nvcc (device code)” column indicates the first version of nvcc that contains an implementation of this feature (if it has been implemented) for device code.
下表列出了已被纳入 C++11 标准的新语言特性。 “提案”列提供了一个链接,指向描述该特性的 ISO C++委员会提案,而“在 nvcc 中可用(设备代码)”列则指示了包含此特性实现的第一个 nvcc 版本(如果已实现)用于设备代码。

Table 18 C++11 Language Features
表 18 C++11 语言特性 

Language Feature 语言特性

C++11 Proposal C++11 提案

Available in nvcc (device code)
在 nvcc 中可用(设备代码)

Rvalue references Rvalue 引用

N2118

7.0

Rvalue references for *this
*this 的 Rvalue 引用

N2439

7.0

Initialization of class objects by rvalues
通过 rvalues 初始化类对象

N1610

7.0

Non-static data member initializers
非静态数据成员初始化程序

N2756

7.0

Variadic templates 变参模板

N2242

7.0

Extending variadic template template parameters
扩展可变模板模板参数

N2555

7.0

Initializer lists 初始化列表

N2672

7.0

Static assertions 静态断言

N1720

7.0

auto-typed variables  auto -类型的变量

N1984

7.0

        Multi-declarator auto 多声明符 auto

N1737

7.0

        Removal of auto as a storage-class specifier
删除 auto 作为存储类说明符

N2546

7.0

        New function declarator syntax
新的函数声明语法

N2541

7.0

Lambda expressions Lambda 表达式

N2927

7.0

Declared type of an expression
表达式的声明类型

N2343

7.0

        Incomplete return types 不完整的返回类型

N3276

7.0

Right angle brackets 右尖括号

N1757

7.0

Default template arguments for function templates
函数模板的默认模板参数

DR226

7.0

Solving the SFINAE problem for expressions
解决表达式的 SFINAE 问题

DR339

7.0

Alias templates 别名模板

N2258

7.0

Extern templates 外部模板

N1987

7.0

Null pointer constant 空指针常量

N2431

7.0

Strongly-typed enums 强类型枚举

N2347

7.0

Forward declarations for enums
枚举的前向声明

N2764 DR1206

7.0

Standardized attribute syntax
标准化属性语法

N2761

7.0

Generalized constant expressions
广义常量表达式

N2235

7.0

Alignment support 对齐支持

N2341

7.0

Conditionally-support behavior
有条件支持行为

N1627

7.0

Changing undefined behavior into diagnosable errors
将未定义行为更改为可诊断错误

N1727

7.0

Delegating constructors 委托构造函数

N1986

7.0

Inheriting constructors 继承构造函数

N2540

7.0

Explicit conversion operators
显式转换运算符

N2437

7.0

New character types 新字符类型

N2249

7.0

Unicode string literals Unicode 字符串字面值

N2442

7.0

Raw string literals 原始字符串字面量

N2442

7.0

Universal character names in literals
字面值中的通用字符名称

N2170

7.0

User-defined literals 用户定义的字面值常量

N2765

7.0

Standard Layout Types 标准布局类型

N2342

7.0

Defaulted functions 默认函数

N2346

7.0

Deleted functions 已删除的函数

N2346

7.0

Extended friend declarations
扩展友元声明

N1791

7.0

Extending sizeof 扩展 sizeof

N2253 DR850

7.0

Inline namespaces 内联命名空间

N2535

7.0

Unrestricted unions 不受限制的联合

N2544

7.0

Local and unnamed types as template arguments
本地和未命名类型作为模板参数

N2657

7.0

Range-based for 基于范围的 for

N2930

7.0

Explicit virtual overrides
显式虚拟覆盖

N2928 N3206 N3272

7.0

Minimal support for garbage collection and reachability-based leak detection
垃圾回收和基于可达性的泄漏检测的最小支持

N2670

N/A (see Restrictions) N/A (请参阅限制)

Allowing move constructors to throw [noexcept]
允许移动构造函数抛出[noexcept]

N3050

7.0

Defining move special member functions
定义移动特殊成员函数

N3053

7.0

Concurrency 并发

Sequence points 序列点

N2239

Atomic operations 原子操作

N2427

Strong Compare and Exchange
强制比较和交换

N2748

Bidirectional Fences 双向栅栏

N2752

Memory model 内存模型

N2429

Data-dependency ordering: atomics and memory model
数据依赖排序:原子操作和内存模型

N2664

Propagating exceptions 传播异常

N2179

Allow atomics use in signal handlers
允许在信号处理程序中使用原子操作

N2547

Thread-local storage 线程本地存储

N2659

Dynamic initialization and destruction with concurrency
使用并发进行动态初始化和销毁

N2660

C99 Features in C++11 C++11 中的 C99 特性

__func__ predefined identifier
__func__ 预定义标识符

N2340

7.0

C99 preprocessor C99 预处理器

N1653

7.0

long long

N1811

7.0

Extended integral types 扩展整数类型

N1988

14.2. C++14 Language Features
14.2. C++14 语言特性 

The following table lists new language features that have been accepted into the C++14 standard.
下表列出了已被接受为 C++14 标准的新语言特性。

Table 19 C++14 Language Features
表 19 C++14 语言特性 

Language Feature 语言特性

C++14 Proposal C++14 提案

Available in nvcc (device code)
在 nvcc 中可用(设备代码)

Tweak to certain C++ contextual conversions
对某些 C++ 上下文转换进行微调

N3323

9.0

Binary literals 二进制文字

N3472

9.0

Functions with deduced return type
带有推导返回类型的函数

N3638

9.0

Generalized lambda capture (init-capture)
广义 lambda 捕获(init-capture)

N3648

9.0

Generic (polymorphic) lambda expressions
通用(多态)lambda 表达式

N3649

9.0

Variable templates 变量模板

N3651

9.0

Relaxing requirements on constexpr functions
放宽对 constexpr 函数的要求

N3652

9.0

Member initializers and aggregates
成员初始化器和聚合

N3653

9.0

Clarifying memory allocation
澄清内存分配

N3664

Sized deallocation 大小化分配

N3778

[[deprecated]] attribute  [[deprecated]] 属性

N3760

9.0

Single-quotation-mark as a digit separator
单引号作为数字分隔符

N3781

9.0

14.3. C++17 Language Features
14.3. C++17 语言特性 

All C++17 language features are supported in nvcc version 11.0 and later, subject to restrictions described here.
所有 C++17 语言特性在 nvcc 版本 11.0 及更高版本中得到支持,但受到此处描述的限制。

14.4. C++20 Language Features
14.4. C++20 语言特性 

All C++20 language features are supported in nvcc version 12.0 and later, subject to restrictions described here.
所有 C++20 语言特性在 nvcc 版本 12.0 及更高版本中得到支持,但受到此处描述的限制。

14.5. Restrictions 14.5. 限制条件 

14.5.1. Host Compiler Extensions
14.5.1. 主机编译器扩展 

Host compiler specific language extensions are not supported in device code.
设备代码中不支持主机编译器特定的语言扩展。

__Complex types are only supported in host code.
__Complex 类型仅在宿主代码中受支持。

__int128 type is supported in device code when compiled in conjunction with a host compiler that supports it.
当与支持它的主机编译器一起编译时, __int128 类型在设备代码中受支持。

__float128 type is only supported in host code on 64-bit x86 Linux platforms. A constant expression of __float128 type may be processed by the compiler in a floating point representation with lower precision.
__float128 类型仅在 64 位 x86 Linux 平台的主机代码中受支持。 __float128 类型的常量表达式可能会被编译器以较低精度的浮点表示处理。

14.5.2. Preprocessor Symbols
14.5.2. 预处理器符号 

14.5.2.1. __CUDA_ARCH__
14.5.2.1. __CUDA_ARCH__  14.5.2.1. __CUDA_ARCH__ 

  1. The type signature of the following entities shall not depend on whether __CUDA_ARCH__ is defined or not, or on a particular value of __CUDA_ARCH__:
    以下实体的类型签名不应取决于 __CUDA_ARCH__ 是否已定义,也不应取决于 __CUDA_ARCH__ 的特定值:

    • __global__ functions and function templates
      __global__ 函数和函数模板

    • __device__ and __constant__ variables
      __device____constant__ 变量

    • textures and surfaces 纹理和表面

    Example: 示例:

    #if !defined(__CUDA_ARCH__)
    typedef int mytype;
    #else
    typedef double mytype;
    #endif
    
    __device__ mytype xxx;         // error: xxx's type depends on __CUDA_ARCH__
    __global__ void foo(mytype in, // error: foo's type depends on __CUDA_ARCH__
                        mytype *ptr)
    {
      *ptr = in;
    }
    
  2. If a __global__ function template is instantiated and launched from the host, then the function template must be instantiated with the same template arguments irrespective of whether __CUDA_ARCH__ is defined and regardless of the value of __CUDA_ARCH__.
    如果从主机实例化并启动 __global__ 函数模板,则无论是否定义 __CUDA_ARCH__ 以及 __CUDA_ARCH__ 的值如何,函数模板都必须使用相同的模板参数进行实例化。

    Example: 示例:

    __device__ int result;
    template <typename T>
    __global__ void kern(T in)
    {
      result = in;
    }
    
    __host__ __device__ void foo(void)
    {
    #if !defined(__CUDA_ARCH__)
      kern<<<1,1>>>(1);      // error: "kern<int>" instantiation only
                             // when __CUDA_ARCH__ is undefined!
    #endif
    }
    
    int main(void)
    {
      foo();
      cudaDeviceSynchronize();
      return 0;
    }
    
  3. In separate compilation mode, the presence or absence of a definition of a function or variable with external linkage shall not depend on whether __CUDA_ARCH__ is defined or on a particular value of __CUDA_ARCH__14.
    在单独编译模式下,具有外部链接的函数或变量的定义与 __CUDA_ARCH__ 是否定义或 __CUDA_ARCH__ 的特定值无关。

    Example: 示例:

    #if !defined(__CUDA_ARCH__)
    void foo(void) { }                  // error: The definition of foo()
                                        // is only present when __CUDA_ARCH__
                                        // is undefined
    #endif
    
  4. In separate compilation, __CUDA_ARCH__ must not be used in headers such that different objects could contain different behavior. Or, it must be guaranteed that all objects will compile for the same compute_arch. If a weak function or template function is defined in a header and its behavior depends on __CUDA_ARCH__, then the instances of that function in the objects could conflict if the objects are compiled for different compute arch.
    在单独编译中, __CUDA_ARCH__ 不能在头文件中使用,以便不同的对象可能包含不同的行为。或者,必须保证所有对象将为相同的 compute_arch 进行编译。如果在头文件中定义了弱函数或模板函数,并且其行为取决于 __CUDA_ARCH__ ,那么如果为不同的 compute arch 编译对象,则对象中该函数的实例可能会发生冲突。

    For example, if an a.h contains:
    例如,如果一个 a.h 包含:

    template<typename T>
    __device__ T* getptr(void)
    {
    #if __CUDA_ARCH__ == 700
      return NULL; /* no address */
    #else
      __shared__ T arr[256];
      return arr;
    #endif
    }
    

    Then if a.cu and b.cu both include a.h and instantiate getptr for the same type, and b.cu expects a non-NULL address, and compile with:
    那么如果 a.cub.cu 都包含 a.h 并为相同类型实例化 getptr ,并且 b.cu 期望非 NULL 地址,并使用以下编译:

    nvcc –arch=compute_70 –dc a.cu
    nvcc –arch=compute_80 –dc b.cu
    nvcc –arch=sm_80 a.o b.o
    

    At link time only one version of the getptr is used, so the behavior would depend on which version is chosen. To avoid this, either a.cu and b.cu must be compiled for the same compute arch, or __CUDA_ARCH__ should not be used in the shared header function.
    在链接时只使用 getptr 的一个版本,因此行为取决于选择了哪个版本。为了避免这种情况, a.cub.cu 必须为相同的计算架构编译,或者在共享头函数中不应使用 __CUDA_ARCH__

The compiler does not guarantee that a diagnostic will be generated for the unsupported uses of __CUDA_ARCH__ described above.
编译器不保证会为上述 __CUDA_ARCH__ 的不支持用法生成诊断。

14.5.3. Qualifiers 14.5.3. 限定符 

14.5.3.1. Device Memory Space Specifiers
14.5.3.1. 设备内存空间说明符 

The __device__, __shared__, __managed__ and __constant__ memory space specifiers are not allowed on:
__device____shared____managed____constant__ 内存空间标识符不允许在:

  • class, struct, and union data members,
    classstructunion 数据成员,

  • formal parameters, 形式参数

  • non-extern variable declarations within a function that executes on the host.
    在主机上执行的函数内的非外部变量声明。

The __device__, __constant__ and __managed__ memory space specifiers are not allowed on variable declarations that are neither extern nor static within a function that executes on the device.
不允许在在设备上执行的函数内既不是 extern 也不是 static 的变量声明中使用 __device____constant____managed__ 内存空间限定符。

A __device__, __constant__, __managed__ or __shared__ variable definition cannot have a class type with a non-empty constructor or a non-empty destructor. A constructor for a class type is considered empty at a point in the translation unit, if it is either a trivial constructor or it satisfies all of the following conditions:
变量定义不能有非空构造函数或非空析构函数的类类型。在翻译单元中的某一点,类类型的构造函数被视为空,如果它是一个平凡构造函数或满足以下所有条件之一:

  • The constructor function has been defined.
    构造函数已被定义。

  • The constructor function has no parameters, the initializer list is empty and the function body is an empty compound statement.
    构造函数没有参数,初始化列表为空,函数体是一个空的复合语句。

  • Its class has no virtual functions, no virtual base classes and no non-static data member initializers.
    它的类没有虚函数,没有虚基类,也没有非静态数据成员初始化器。

  • The default constructors of all base classes of its class can be considered empty.
    其类的所有基类的默认构造函数可以被视为空。

  • For all the nonstatic data members of its class that are of class type (or array thereof), the default constructors can be considered empty.
    对于其类的所有非静态数据成员,如果是类类型(或其数组),则默认构造函数可以视为空。

A destructor for a class is considered empty at a point in the translation unit, if it is either a trivial destructor or it satisfies all of the following conditions:
如果一个类的析构函数是平凡析构函数或满足以下所有条件之一,则在翻译单元中被视为空析构函数:

  • The destructor function has been defined.
    析构函数已被定义。

  • The destructor function body is an empty compound statement.
    析构函数体是一个空的复合语句。

  • Its class has no virtual functions and no virtual base classes.
    它的类没有虚函数,也没有虚基类。

  • The destructors of all base classes of its class can be considered empty.
    其类的所有基类的析构函数可以被视为空。

  • For all the nonstatic data members of its class that are of class type (or array thereof), the destructor can be considered empty.
    对于其类的所有非静态数据成员,如果是类类型(或其数组),析构函数可以被视为空。

When compiling in the whole program compilation mode (see the nvcc user manual for a description of this mode), __device__, __shared__, __managed__ and __constant__ variables cannot be defined as external using the extern keyword. The only exception is for dynamically allocated __shared__ variables as described in index.html#__shared__.
在整个程序编译模式下编译时(请参阅 nvcc 用户手册了解此模式的描述), __device____shared____managed____constant__ 变量不能使用 extern 关键字定义为外部。唯一的例外是动态分配的 __shared__ 变量,如 index.html#__shared__ 中所述。

When compiling in the separate compilation mode (see the nvcc user manual for a description of this mode), __device__, __shared__, __managed__ and __constant__ variables can be defined as external using the extern keyword. nvlink will generate an error when it cannot find a definition for an external variable (unless it is a dynamically allocated __shared__ variable).
在单独编译模式下编译(请参阅 nvcc 用户手册了解此模式的描述), extern 关键字可以将 __device____shared____managed____constant__ 变量定义为外部变量。当找不到外部变量的定义时, nvlink 会生成错误(除非它是动态分配的 __shared__ 变量)。

14.5.3.2. __managed__ Memory Space Specifier
14.5.3.2. __managed__ 内存空间说明符 

Variables marked with the __managed__ memory space specifier (“managed” variables) have the following restrictions:
使用 __managed__ 内存空间说明符(“托管”变量)标记的变量具有以下限制:

  • The address of a managed variable is not a constant expression.
    托管变量的地址不是常量表达式。

  • A managed variable shall not have a const qualified type.
    受管理的变量不得具有 const 限定类型。

  • A managed variable shall not have a reference type.
    托管变量不得具有引用类型。

  • The address or value of a managed variable shall not be used when the CUDA runtime may not be in a valid state, including the following cases:
    当 CUDA 运行时可能处于无效状态时,不应使用托管变量的地址或值,包括以下情况:

    • In static/dynamic initialization or destruction of an object with static or thread local storage duration.
      在具有静态或线程本地存储期限的对象的静态/动态初始化或销毁中。

    • In code that executes after exit() has been called (for example, a function marked with gcc’s “__attribute__((destructor))”).
      在调用 exit() 后执行的代码中(例如,使用 gcc 的“ __attribute__((destructor)) ”标记的函数)。

    • In code that executes when CUDA runtime may not be initialized (for example, a function marked with gcc’s “__attribute__((constructor))”).
      在可能未初始化 CUDA 运行时的代码中执行(例如,使用 gcc 的“ __attribute__((constructor)) ”标记的函数)。

  • A managed variable cannot be used as an unparenthesized id-expression argument to a decltype() expression.
    受管理的变量不能作为未加括号的 id 表达式参数用于 decltype() 表达式。

  • Managed variables have the same coherence and consistency behavior as specified for dynamically allocated managed memory.
    托管变量具有与动态分配的托管内存指定的一致性和一致性行为相同的特性。

  • When a CUDA program containing managed variables is run on an execution platform with multiple GPUs, the variables are allocated only once, and not per GPU.
    当在具有多个 GPU 的执行平台上运行包含托管变量的 CUDA 程序时,这些变量仅分配一次,而不是每个 GPU。

  • A managed variable declaration without the extern linkage is not allowed within a function that executes on the host.
    在主机上执行的函数内不允许使用未带有 extern 链接的托管变量声明。

  • A managed variable declaration without the extern or static linkage is not allowed within a function that executes on the device.
    在设备上执行的函数内不允许使用未带有 extern 或 static 链接的托管变量声明。

Here are examples of legal and illegal uses of managed variables:
以下是托管变量的合法和非法使用示例:

__device__ __managed__ int xxx = 10;         // OK

int *ptr = &xxx;                             // error: use of managed variable
                                             // (xxx) in static initialization
struct S1_t {
  int field;
  S1_t(void) : field(xxx) { };
};
struct S2_t {
  ~S2_t(void) { xxx = 10; }
};

S1_t temp1;                                 // error: use of managed variable
                                            // (xxx) in dynamic initialization

S2_t temp2;                                 // error: use of managed variable
                                            // (xxx) in the destructor of
                                            // object with static storage
                                            // duration

__device__ __managed__ const int yyy = 10;  // error: const qualified type

__device__ __managed__ int &zzz = xxx;      // error: reference type

template <int *addr> struct S3_t { };
S3_t<&xxx> temp;                            // error: address of managed
                                            // variable(xxx) not a
                                            // constant expression

__global__ void kern(int *ptr)
{
  assert(ptr == &xxx);                      // OK
  xxx = 20;                                 // OK
}
int main(void)
{
  int *ptr = &xxx;                          // OK
  kern<<<1,1>>>(ptr);
  cudaDeviceSynchronize();
  xxx++;                                    // OK
  decltype(xxx) qqq;                        // error: managed variable(xxx) used
                                            // as unparenthized argument to
                                            // decltype

  decltype((xxx)) zzz = yyy;                // OK
}

14.5.3.3. Volatile Qualifier

The compiler is free to optimize reads and writes to global or shared memory (for example, by caching global reads into registers or L1 cache) as long as it respects the memory ordering semantics of memory fence functions (Memory Fence Functions) and memory visibility semantics of synchronization functions (Synchronization Functions).
编译器可以自由地优化对全局或共享内存的读写操作(例如,通过将全局读取缓存到寄存器或 L1 缓存),只要它遵守内存栅栏函数(内存栅栏函数)的内存排序语义和同步函数(同步函数)的内存可见性语义。

These optimizations can be disabled using the volatile keyword: If a variable located in global or shared memory is declared as volatile, the compiler assumes that its value can be changed or used at any time by another thread and therefore any reference to this variable compiles to an actual memory read or write instruction.
这些优化可以使用 volatile 关键字禁用:如果位于全局或共享内存中的变量被声明为 volatile,则编译器会假定其值可以随时被另一个线程更改或使用,因此对该变量的任何引用都会编译为实际的内存读取或写入指令。

14.5.4. Pointers 14.5.4. 指针 

Dereferencing a pointer either to global or shared memory in code that is executed on the host, or to host memory in code that is executed on the device results in an undefined behavior, most often in a segmentation fault and application termination.
在主机上执行的代码中对全局或共享内存中的指针进行解引用,或者在设备上执行的代码中对主机内存中的指针进行解引用会导致未定义行为,通常表现为分段错误和应用程序终止。

The address obtained by taking the address of a __device__, __shared__ or __constant__ variable can only be used in device code. The address of a __device__ or __constant__ variable obtained through cudaGetSymbolAddress() as described in Device Memory can only be used in host code.
通过获取 __device____shared____constant__ 变量的地址获得的地址只能在设备代码中使用。通过 cudaGetSymbolAddress() 中描述的设备内存获取的 __device____constant__ 变量的地址只能在主机代码中使用。

14.5.5. Operators 14.5.5. 运算符 

14.5.5.1. Assignment Operator
14.5.5.1. 赋值运算符 

__constant__ variables can only be assigned from the host code through runtime functions (Device Memory); they cannot be assigned from the device code.
__constant__ 变量只能通过运行时函数(设备内存)从主机代码分配;不能从设备代码分配。

__shared__ variables cannot have an initialization as part of their declaration.
__shared__ 变量不能作为声明的一部分进行初始化。

It is not allowed to assign values to any of the built-in variables defined in Built-in Variables.
不允许为内置变量中定义的任何内置变量分配值。

14.5.5.2. Address Operator
14.5.5.2. 地址运算符 

It is not allowed to take the address of any of the built-in variables defined in Built-in Variables.
不允许获取内置变量中定义的任何内置变量的地址。

14.5.6. Run Time Type Information (RTTI)
14.5.6. 运行时类型信息(RTTI) 

The following RTTI-related features are supported in host code, but not in device code.
以下 RTTI 相关功能在主机代码中受支持,但在设备代码中不受支持。

  • typeid operator  typeid 运算符

  • std::type_info

  • dynamic_cast operator  dynamic_cast 运算符

14.5.7. Exception Handling
14.5.7. 异常处理 

Exception handling is only supported in host code, but not in device code.
异常处理仅在主机代码中受支持,而不在设备代码中受支持。

Exception specification is not supported for __global__ functions.
异常规范不支持 __global__ 函数。

14.5.8. Standard Library
14.5.8. 标准库 

Standard libraries are only supported in host code, but not in device code, unless specified otherwise.
标准库仅在主机代码中受支持,而不在设备代码中受支持,除非另有规定。

14.5.9. Namespace Reservations
14.5.9. 命名空间保留 

Unless an exception is otherwise noted, it is undefined behavior to add any declarations or definitions to cuda::, nv::, cooperative_groups:: or any namespace nested within.
除非另有说明,否则向 cuda::nv::cooperative_groups:: 或任何嵌套命名空间添加任何声明或定义都是未定义行为。

Examples: 示例:

namespace cuda{
   // Bad: class declaration added to namespace cuda
   struct foo{};

   // Bad: function definition added to namespace cuda
   cudaStream_t make_stream(){
      cudaStream_t s;
      cudaStreamCreate(&s);
      return s;
   }
} // namespace cuda

namespace cuda{
   namespace utils{
      // Bad: function definition added to namespace nested within cuda
      cudaStream_t make_stream(){
          cudaStream_t s;
          cudaStreamCreate(&s);
          return s;
      }
   } // namespace utils
} // namespace cuda

namespace utils{
   namespace cuda{
     // Okay: namespace cuda may be used nested within a non-reserved namespace
     cudaStream_t make_stream(){
          cudaStream_t s;
          cudaStreamCreate(&s);
          return s;
      }
   } // namespace cuda
} // namespace utils

// Bad: Equivalent to adding symbols to namespace cuda at global scope
using namespace utils;

14.5.10. Functions 14.5.10. 函数 

14.5.10.1. External Linkage
14.5.10.1. 外部链接 

A call within some device code of a function declared with the extern qualifier is only allowed if the function is defined within the same compilation unit as the device code, i.e., a single file or several files linked together with relocatable device code and nvlink.
在设备代码中调用使用 extern 限定符声明的函数只有在该函数在与设备代码相同的编译单元中定义时才允许,即单个文件或多个文件与可重定位设备代码和 nvlink 链接在一起。

14.5.10.2. Implicitly-declared and explicitly-defaulted functions
14.5.10.2. 隐式声明和显式默认函数 

Let F denote a function that is either implicitly-declared or is explicitly-defaulted on its first declaration The execution space specifiers (__host__, __device__) for F are the union of the execution space specifiers of all the functions that invoke it (note that a __global__ caller will be treated as a __device__ caller for this analysis). For example:
F 表示一个函数,该函数在其第一次声明时要么是隐式声明的,要么是显式默认的。 F 的执行空间限定符( __host____device__ )是调用它的所有函数的执行空间限定符的并集(请注意,对于此分析, __global__ 调用方将被视为 __device__ 调用方)。例如:

class Base {
  int x;
public:
  __host__ __device__ Base(void) : x(10) {}
};

class Derived : public Base {
  int y;
};

class Other: public Base {
  int z;
};

__device__ void foo(void)
{
  Derived D1;
  Other D2;
}

__host__ void bar(void)
{
  Other D3;
}

Here, the implicitly-declared constructor function “Derived::Derived” will be treated as a __device__ function, since it is invoked only from the __device__ function “foo”. The implicitly-declared constructor function “Other::Other” will be treated as a __host__ __device__ function, since it is invoked both from a __device__ function “foo” and a __host__ function “bar”.
在这里,隐式声明的构造函数“Derived::Derived”将被视为 __device__ 函数,因为它仅从 __device__ 函数“foo”中调用。隐式声明的构造函数“Other::Other”将被视为 __host__ __device__ 函数,因为它既从 __device__ 函数“foo”中调用,也从 __host__ 函数“bar”中调用。

In addition, if F is a virtual destructor, then the execution spaces of each virtual destructor D overridden by F are added to the set of execution spaces for F, if D is either not implicitly defined or is explicitly defaulted on a declaration other than its first declaration.
此外,如果 F 是虚析构函数,则每个虚析构函数 DF 覆盖的执行空间将添加到 F 的执行空间集中,如果 D 不是隐式定义或在其第一次声明之外的声明上显式默认。

For example: 例如:

struct Base1 { virtual __host__ __device__ ~Base1() { } };
struct Derived1 : Base1 { }; // implicitly-declared virtual destructor
                             // ~Derived1 has __host__ __device__
                             // execution space specifiers

struct Base2 { virtual __device__ ~Base2(); };
__device__ Base2::~Base2() = default;
struct Derived2 : Base2 { }; // implicitly-declared virtual destructor
                             // ~Derived2 has __device__ execution
                             // space specifiers

14.5.10.3. Function Parameters
14.5.10.3. 函数参数 

__global__ function parameters are passed to the device via constant memory and are limited to 32,764 bytes starting with Volta, and 4 KB on older architectures.
__global__ 函数参数通过常量内存传递到设备,并从 Volta 开始限制为 32,764 字节,在旧架构上为 4 KB。

__global__ functions cannot have a variable number of arguments.
__global__ 函数不能具有可变数量的参数。

__global__ function parameters cannot be pass-by-reference.
__global__ 函数参数不能按引用传递。

In separate compilation mode, if a __device__ or __global__ function is ODR-used in a particular translation unit, then the parameter and return types of the function must be complete in that translation unit.
在单独编译模式下,如果在特定的翻译单元中对 __device____global__ 函数进行了 ODR 使用,则该函数的参数和返回类型必须在该翻译单元中完整。

Example: 示例:

//first.cu:
struct S;
__device__ void foo(S); // error: type 'S' is incomplete
__device__ auto *ptr = foo;

int main() { }

//second.cu:
struct S { int x; };
__device__ void foo(S) { }
//compiler invocation
$nvcc -std=c++14 -rdc=true first.cu second.cu -o first
nvlink error   : Prototype doesn't match for '_Z3foo1S' in '/tmp/tmpxft_00005c8c_00000000-18_second.o', first defined in '/tmp/tmpxft_00005c8c_00000000-18_second.o'
nvlink fatal   : merge_elf failed
14.5.10.3.1. __global__ Function Argument Processing
14.5.10.3.1. __global__ 函数参数处理 

When a __global__ function is launched from device code, each argument must be trivially copyable and trivially destructible.
当从设备代码启动 __global__ 函数时,每个参数必须是可平凡复制和可平凡销毁的。

When a __global__ function is launched from host code, each argument type is allowed to be non-trivially copyable or non-trivially-destructible, but the processing for such types does not follow the standard C++ model, as described below. User code must ensure that this workflow does not affect program correctness. The workflow diverges from standard C++ in two areas:
当从主机代码启动 __global__ 函数时,允许每个参数类型为非平凡可复制或非平凡可销毁,但是对于这些类型的处理不遵循标准的 C++ 模型,如下所述。用户代码必须确保此工作流程不会影响程序的正确性。该工作流程在两个方面与标准 C++ 不同:

  1. Memcpy instead of copy constructor invocation
    使用 memcpy 而不是调用复制构造函数

    When lowering a __global__ function launch from host code, the compiler generates stub functions that copy the parameters one or more times by value, before eventually using memcpy to copy the arguments to the __global__ function’s parameter memory on the device. This occurs even if an argument was non-trivially-copyable, and therefore may break programs where the copy constructor has side effects.
    当从主机代码降低 __global__ 函数启动时,编译器会生成存根函数,通过值复制参数一次或多次,最终使用 memcpy 将参数复制到设备上 __global__ 函数的参数内存。即使参数是非平凡可复制的,这种情况仍会发生,因此可能会破坏具有副作用的复制构造函数的程序。

    Example: 示例:

    #include <cassert>
    struct S {
     int x;
     int *ptr;
     __host__ __device__ S() { }
     __host__ __device__ S(const S &) { ptr = &x; }
    };
    
    __global__ void foo(S in) {
     // this assert may fail, because the compiler
     // generated code will memcpy the contents of "in"
     // from host to kernel parameter memory, so the
     // "in.ptr" is not initialized to "&in.x" because
     // the copy constructor is skipped.
     assert(in.ptr == &in.x);
    }
    
    int main() {
      S tmp;
      foo<<<1,1>>>(tmp);
      cudaDeviceSynchronize();
    }
    

    Example: 示例:

    #include <cassert>
    
    __managed__ int counter;
    struct S1 {
    S1() { }
    S1(const S1 &) { ++counter; }
    };
    
    __global__ void foo(S1) {
    
    /* this assertion may fail, because
       the compiler generates stub
       functions on the host for a kernel
       launch, and they may copy the
       argument by value more than once.
    */
    assert(counter == 1);
    }
    
    int main() {
    S1 V;
    foo<<<1,1>>>(V);
    cudaDeviceSynchronize();
    }
    
  2. Destructor may be invoked before the ``__global__`` function has finished
    析构函数可能在``__global__``函数完成之前被调用

    Kernel launches are asynchronous with host execution. As a result, if a __global__ function argument has a non-trivial destructor, the destructor may execute in host code even before the __global__ function has finished execution. This may break programs where the destructor has side effects.
    内核启动与主机执行是异步的。因此,如果 __global__ 函数参数具有非平凡析构函数,则即使在 __global__ 函数执行完成之前,析构函数也可能在主机代码中执行。这可能会破坏具有副作用析构函数的程序。

    Example: 示例:

    struct S {
     int *ptr;
     S() : ptr(nullptr) { }
     S(const S &) { cudaMallocManaged(&ptr, sizeof(int)); }
     ~S() { cudaFree(ptr); }
    };
    
    __global__ void foo(S in) {
    
      //error: This store may write to memory that has already been
      //       freed (see below).
      *(in.ptr) = 4;
    
    }
    
    int main() {
     S V;
    
     /* The object 'V' is first copied by value to a compiler-generated
      * stub function that does the kernel launch, and the stub function
      * bitwise copies the contents of the argument to kernel parameter
      * memory.
      * However, GPU kernel execution is asynchronous with host
      * execution.
      * As a result, S::~S() will execute when the stub function   returns, releasing allocated memory, even though the kernel may not have finished execution.
      */
     foo<<<1,1>>>(V);
     cudaDeviceSynchronize();
    }
    
14.5.10.3.2. Toolkit and Driver Compatibility
14.5.10.3.2. 工具包和驱动程序兼容性 

Developers must use the 12.1 Toolkit and r530 driver or higher to compile, launch, and debug kernels that accept parameters larger than 4KB. If such kernels are launched on older drivers, CUDA will issue the error CUDA_ERROR_NOT_SUPPORTED.
开发人员必须使用 12.1 工具包和 r530 驱动程序或更高版本来编译、启动和调试接受大于 4KB 参数的内核。如果在旧驱动程序上启动这样的内核,CUDA 将发出错误 CUDA_ERROR_NOT_SUPPORTED

14.5.10.4. Static Variables within Function
14.5.10.4. 函数内的静态变量 

Variable memory space specifiers are allowed in the declaration of a static variable V within the immediate or nested block scope of a function F where:
变量内存空间限定符允许在函数 F 的立即或嵌套块作用域内声明静态变量 V

  • F is a __global__ or __device__-only function.
    F 是一个仅支持 __global____device__ 的函数。

  • F is a __host__ __device__ function and __CUDA_ARCH__ is defined 17.
    F 是一个 __host__ __device__ 函数, __CUDA_ARCH__ 被定义为 17。

If no explicit memory space specifier is present in the declaration of V, an implicit __device__ specifier is assumed during device compilation.
如果在 V 的声明中没有明确的内存空间限定符,则在设备编译期间假定使用隐式的 __device__ 限定符。

V has the same initialization restrictions as a variable with the same memory space specifiers declared in namespace scope for example a __device__ variable cannot have a ‘non-empty’ constructor (see Device Memory Space Specifiers).
V 具有与命名空间范围中声明的具有相同内存空间说明符的变量相同的初始化限制,例如, __device__ 变量不能具有“非空”构造函数(请参阅设备内存空间说明符)。

Examples of legal and illegal uses of function-scope static variables are shown below.
下面是函数作用域静态变量的合法和非法使用示例。

struct S1_t {
  int x;
};

struct S2_t {
  int x;
  __device__ S2_t(void) { x = 10; }
};

struct S3_t {
  int x;
  __device__ S3_t(int p) : x(p) { }
};

__device__ void f1() {
  static int i1;              // OK, implicit __device__ memory space specifier
  static int i2 = 11;         // OK, implicit __device__ memory space specifier
  static __managed__ int m1;  // OK
  static __device__ int d1;   // OK
  static __constant__ int c1; // OK

  static S1_t i3;             // OK, implicit __device__ memory space specifier
  static S1_t i4 = {22};      // OK, implicit __device__ memory space specifier

  static __shared__ int i5;   // OK

  int x = 33;
  static int i6 = x;          // error: dynamic initialization is not allowed
  static S1_t i7 = {x};       // error: dynamic initialization is not allowed

  static S2_t i8;             // error: dynamic initialization is not allowed
  static S3_t i9(44);         // error: dynamic initialization is not allowed
}

__host__ __device__ void f2() {
  static int i1;              // OK, implicit __device__ memory space specifier
                              // during device compilation.
#ifdef __CUDA_ARCH__
  static __device__ int d1;   // OK, declaration is only visible during device
                              // compilation  (__CUDA_ARCH__ is defined)
#else
  static int d0;              // OK, declaration is only visible during host
                              // compilation (__CUDA_ARCH__ is not defined)
#endif

  static __device__ int d2;   // error: __device__ variable inside
                              // a host function during host compilation
                              // i.e. when __CUDA_ARCH__ is not defined

  static __shared__ int i2;  // error: __shared__ variable inside
                             // a host function during host compilation
                             // i.e. when __CUDA_ARCH__ is not defined
}

14.5.10.5. Function Pointers
14.5.10.5. 函数指针 

The address of a __global__ function taken in host code cannot be used in device code (e.g. to launch the kernel). Similarly, the address of a __global__ function taken in device code cannot be used in host code.
在主机代码中获取的 __global__ 函数地址不能在设备代码中使用(例如,用于启动内核)。同样,在设备代码中获取的 __global__ 函数地址不能在主机代码中使用。

It is not allowed to take the address of a __device__ function in host code.
在主机代码中不允许获取 __device__ 函数的地址。

14.5.10.6. Function Recursion
14.5.10.6. 函数递归 

__global__ functions do not support recursion.
__global__ 函数不支持递归。

14.5.10.7. Friend Functions
14.5.10.7. 友元函数 

A __global__ function or function template cannot be defined in a friend declaration.
在友元声明中不能定义 __global__ 函数或函数模板。

Example: 示例:

struct S1_t {
  friend __global__
  void foo1(void);  // OK: not a definition
  template<typename T>
  friend __global__
  void foo2(void); // OK: not a definition

  friend __global__
  void foo3(void) { } // error: definition in friend declaration

  template<typename T>
  friend __global__
  void foo4(void) { } // error: definition in friend declaration
};

14.5.10.8. Operator Function
14.5.10.8. 操作符函数 

An operator function cannot be a __global__ function.
运算符函数不能是 __global__ 函数。

14.5.10.9. Allocation and Deallocation Functions
14.5.10.9. 分配和释放函数 

A user-defined operator new, operator new[], operator delete, or operator delete[] cannot be used to replace the corresponding __host__ or __device__ builtins provided by the compiler.
用户定义的 operator newoperator new[]operator deleteoperator delete[] 不能用来替换编译器提供的相应 __host____device__ 内置函数。

14.5.11. Classes 14.5.11. 类 

14.5.11.1. Data Members
14.5.11.1. 数据成员 

Static data members are not supported except for those that are also const-qualified (see Const-qualified variables).
静态数据成员不受支持,除非它们也被 const 修饰(请参阅 const 修饰的变量)。

14.5.11.2. Function Members
14.5.11.2. 函数成员 

Static member functions cannot be __global__ functions.
静态成员函数不能是 __global__ 函数。

14.5.11.3. Virtual Functions
14.5.11.3. 虚函数 

When a function in a derived class overrides a virtual function in a base class, the execution space specifiers (i.e., __host__, __device__) on the overridden and overriding functions must match.
当派生类中的函数覆盖基类中的虚函数时,覆盖和被覆盖函数的执行空间修饰符(即, __host____device__ )必须匹配。

It is not allowed to pass as an argument to a __global__ function an object of a class with virtual functions.
不允许将带有虚函数的类的对象作为参数传递给 __global__ 函数。

If an object is created in host code, invoking a virtual function for that object in device code has undefined behavior.
如果在主机代码中创建了一个对象,则在设备代码中调用该对象的虚函数会导致未定义行为。

If an object is created in device code, invoking a virtual function for that object in host code has undefined behavior.
如果在设备代码中创建了一个对象,则在主机代码中调用该对象的虚函数会导致未定义的行为。

See Windows-Specific for additional constraints when using the Microsoft host compiler.
请参阅 Windows-Specific 以获取在使用 Microsoft 主机编译器时的额外约束条件。

Example: 示例:

struct S1 { virtual __host__ __device__ void foo() { } };

__managed__ S1 *ptr1, *ptr2;

__managed__ __align__(16) char buf1[128];
__global__ void kern() {
  ptr1->foo();     // error: virtual function call on a object
                   //        created in host code.
  ptr2 = new(buf1) S1();
}

int main(void) {
  void *buf;
  cudaMallocManaged(&buf, sizeof(S1), cudaMemAttachGlobal);
  ptr1 = new (buf) S1();
  kern<<<1,1>>>();
  cudaDeviceSynchronize();
  ptr2->foo();  // error: virtual function call on an object
                //        created in device code.
}

14.5.11.4. Virtual Base Classes
14.5.11.4. 虚基类 

It is not allowed to pass as an argument to a __global__ function an object of a class derived from virtual base classes.
不允许将从虚基类派生的类的对象作为参数传递给 __global__ 函数。

See Windows-Specific for additional constraints when using the Microsoft host compiler.
请参阅 Windows-Specific 以获取在使用 Microsoft 主机编译器时的额外约束条件。

14.5.11.5. Anonymous Unions
14.5.11.5. 匿名联合体 

Member variables of a namespace scope anonymous union cannot be referenced in a __global__ or __device__ function.
命名空间范围内的匿名联合体的成员变量不能在 __global____device__ 函数中引用。

14.5.11.6. Windows-Specific
14.5.11.6. Windows 特定 

The CUDA compiler follows the IA64 ABI for class layout, while the Microsoft host compiler does not. Let T denote a pointer to member type, or a class type that satisfies any of the following conditions:
CUDA 编译器遵循 IA64 ABI 用于类布局,而 Microsoft 主机编译器则不遵循。让 T 表示成员类型的指针,或者满足以下任一条件的类类型:

  • T has virtual functions.
    T 有虚函数。

  • T has a virtual base class.
    T 有一个虚基类。

  • T has multiple inheritance with more than one direct or indirect empty base class.
    T 具有多重继承,拥有一个以上的直接或间接空基类。

  • All direct and indirect base classes B of T are empty and the type of the first field F of T uses B in its definition, such that B is laid out at offset 0 in the definition of F.
    所有直接和间接基类 BT 为空,并且 T 的第一个字段 F 的类型在其定义中使用 B ,使得 F 的定义中 B 在偏移 0 处布局。

Let C denote T or a class type that has T as a field type or as a base class type. The CUDA compiler may compute the class layout and size differently than the Microsoft host compiler for the type C.
C 表示 T 或具有 T 作为字段类型或基类类型的类类型。CUDA 编译器可能会计算类的布局和大小,与 Microsoft 主机编译器对于类型 C 可能不同。

As long as the type C is used exclusively in host or device code, the program should work correctly.
只要类型 C 在主机或设备代码中独占使用,程序应该能正常工作。

Passing an object of type C between host and device code has undefined behavior, for example, as an argument to a __global__ function or through cudaMemcpy*() calls.
在主机和设备代码之间传递类型为 C 的对象具有未定义的行为,例如作为 __global__ 函数的参数或通过 cudaMemcpy*() 调用。

Accessing an object of type C or any subobject in device code, or invoking a member function in device code, has undefined behavior if the object is created in host code.
在设备代码中访问类型为 C 的对象或任何子对象,或在设备代码中调用成员函数,如果对象是在主机代码中创建的,则具有未定义行为。

Accessing an object of type C or any subobject in host code, or invoking a member function in host code, has undefined behavior if the object is created in device code 18.
在主机代码中访问类型为 C 的对象或任何子对象,或在主机代码中调用成员函数,如果对象是在设备代码中创建的,则具有未定义行为。

14.5.12. Templates 14.5.12. 模板 

A type or template cannot be used in the type, non-type or template template argument of a __global__ function template instantiation or a __device__/__constant__ variable instantiation if either:
类型或模板不能在类型、非类型或模板模板参数中使用 __global__ 函数模板实例化或 __device__/__constant__ 变量实例化,如果:

  • The type or template is defined within a __host__ or __host__ __device__.
    类型或模板在 __host____host__ __device__ 中定义。

  • The type or template is a class member with private or protected access and its parent class is not defined within a __device__ or __global__ function.
    类型或模板是一个类成员,具有 privateprotected 访问权限,并且其父类未在 __device____global__ 函数中定义。

  • The type is unnamed. 类型未命名。

  • The type is compounded from any of the types above.
    类型是由上述任何类型中的任意一个组合而成。

Example: 示例:

template <typename T>
__global__ void myKernel(void) { }

class myClass {
private:
    struct inner_t { };
public:
    static void launch(void)
    {
       // error: inner_t is used in template argument
       // but it is private
       myKernel<inner_t><<<1,1>>>();
    }
};

// C++14 only
template <typename T> __device__ T d1;

template <typename T1, typename T2> __device__ T1 d2;

void fn() {
  struct S1_t { };
  // error (C++14 only): S1_t is local to the function fn
  d1<S1_t> = {};

  auto lam1 = [] { };
  // error (C++14 only): a closure type cannot be used for
  // instantiating a variable template
  d2<int, decltype(lam1)> = 10;
}

14.5.13. Trigraphs and Digraphs
14.5.13. 三字符和双字符 

Trigraphs are not supported on any platform. Digraphs are not supported on Windows.
三字符不受任何平台支持。双字符在 Windows 上不受支持。

14.5.14. Const-qualified variables
14.5.14. 常量限定的变量 

Let ‘V’ denote a namespace scope variable or a class static member variable that has const qualified type and does not have execution space annotations (for example, __device__, __constant__, __shared__). V is considered to be a host code variable.
让“V”表示一个具有 const 限定类型且没有执行空间注释(例如, __device__, __constant__, __shared__ )的命名空间范围变量或类静态成员变量。V 被视为主机代码变量。

The value of V may be directly used in device code, if
V 的值可以直接在设备代码中使用,如果

  • V has been initialized with a constant expression before the point of use,
    V 已在使用点之前用常量表达式初始化

  • the type of V is not volatile-qualified, and
    V 的类型不是易失性限定的,而

  • it has one of the following types:
    它具有以下类型之一:

    • built-in floating point type except when the Microsoft compiler is used as the host compiler,
      内置浮点类型,除非使用 Microsoft 编译器作为主机编译器

    • built-in integral type. 内置的整数类型。

Device source code cannot contain a reference to V or take the address of V.
设备源代码不能包含对 V 的引用或获取 V 的地址。

Example: 示例:

const int xxx = 10;
struct S1_t {  static const int yyy = 20; };

extern const int zzz;
const float www = 5.0;
__device__ void foo(void) {
  int local1[xxx];          // OK
  int local2[S1_t::yyy];    // OK

  int val1 = xxx;           // OK

  int val2 = S1_t::yyy;     // OK

  int val3 = zzz;           // error: zzz not initialized with constant
                            // expression at the point of use.

  const int &val3 = xxx;    // error: reference to host variable
  const int *val4 = &xxx;   // error: address of host variable
  const float val5 = www;   // OK except when the Microsoft compiler is used as
                            // the host compiler.
}
const int zzz = 20;

14.5.15. Long Double
14.5.15. 长双精度 

The use of long double type is not supported in device code.
设备代码中不支持使用 long double 类型。

14.5.16. Deprecation Annotation
14.5.16. 弃用注解 

nvcc supports the use of deprecated attribute when using gcc, clang, xlC, icc or pgcc host compilers, and the use of deprecated declspec when using the cl.exe host compiler. It also supports the [[deprecated]] standard attribute when the C++14 dialect has been enabled. The CUDA frontend compiler will generate a deprecation diagnostic for a reference to a deprecated entity from within the body of a __device__, __global__ or __host__ __device__ function when __CUDA_ARCH__ is defined (i.e., during device compilation phase). Other references to deprecated entities will be handled by the host compiler, e.g., a reference from within a __host__ function.
nvcc 在使用 gccclangxlCiccpgcc 主机编译器时支持使用 deprecated 属性,并在使用 cl.exe 主机编译器时支持使用 deprecated declspec。当启用 C++14 方言时,还支持 [[deprecated]] 标准属性。CUDA 前端编译器将在 __device____global____host__ __device__ 函数体内引用已弃用实体时生成弃用诊断,当定义 __CUDA_ARCH__ 时(即在设备编译阶段)。对已弃用实体的其他引用将由主机编译器处理,例如在 __host__ 函数内的引用。

The CUDA frontend compiler does not support the #pragma gcc diagnostic or #pragma warning mechanisms supported by various host compilers. Therefore, deprecation diagnostics generated by the CUDA frontend compiler are not affected by these pragmas, but diagnostics generated by the host compiler will be affected. To suppress the warning for device-code, user can use NVIDIA specific pragma #pragma nv_diag_suppress. The nvcc flag -Wno-deprecated-declarations can be used to suppress all deprecation warnings, and the flag -Werror=deprecated-declarations can be used to turn deprecation warnings into errors.
CUDA 前端编译器不支持各种主机编译器支持的 #pragma gcc diagnostic#pragma warning 机制。因此,由 CUDA 前端编译器生成的弃用诊断不受这些编译指示的影响,但主机编译器生成的诊断将受到影响。为了抑制设备代码的警告,用户可以使用 NVIDIA 特定的编译指示#pragma nv_diag_suppress。标志 nvcc 可用于抑制所有弃用警告,标志 -Werror=deprecated-declarations 可用于将弃用警告转换为错误。

14.5.17. Noreturn Annotation
14.5.17. Noreturn 注解 

nvcc supports the use of noreturn attribute when using gcc, clang, xlC, icc or pgcc host compilers, and the use of noreturn declspec when using the cl.exe host compiler. It also supports the [[noreturn]] standard attribute when the C++11 dialect has been enabled.
nvcc 在使用 gccclangxlCiccpgcc 主机编译器时支持使用 noreturn 属性,并在使用 cl.exe 主机编译器时支持使用 noreturn declspec。启用 C++11 方言后,还支持 [[noreturn]] 标准属性。

The attribute/declspec can be used in both host and device code.
属性/declspec 可以在主机代码和设备代码中使用。

14.5.18. [[likely]] / [[unlikely]] Standard Attributes
14.5.18. [[likely]] / [[unlikely]] 标准属性 

These attributes are accepted in all configurations that support the C++ standard attribute syntax. The attributes can be used to hint to the device compiler optimizer whether a statement is more or less likely to be executed compared to any alternative path that does not include the statement.
这些属性在支持 C++标准属性语法的所有配置中都被接受。这些属性可用于提示设备编译器优化器,表明与不包括该语句的任何其他路径相比,该语句更可能或不太可能被执行。

Example: 示例:

__device__ int foo(int x) {

 if (i < 10) [[likely]] { // the 'if' block will likely be entered
  return 4;
 }
 if (i < 20) [[unlikely]] { // the 'if' block will not likely be entered
  return 1;
 }
 return 0;
}

If these attributes are used in host code when __CUDA_ARCH__ is undefined, then they will be present in the code parsed by the host compiler, which may generate a warning if the attributes are not supported. For example, clang11 host compiler will generate an ‘unknown attribute’ warning.
如果这些属性在主机代码中使用时 __CUDA_ARCH__ 未定义,则它们将存在于主机编译器解析的代码中,如果这些属性不受支持,可能会生成警告。例如, clang 11 主机编译器将生成“未知属性”警告。

14.5.19. const and pure GNU Attributes
14.5.19. const 和 pure GNU 属性 

These attributes are supported for both host and device functions, when using a language dialect and host compiler that also supports these attributes e.g. with g++ host compiler.
这些属性支持主机和设备函数,当使用支持这些属性的语言方言和主机编译器时,例如使用 g++ 主机编译器。

For a device function annotated with the pure attribute, the device code optimizer assumes that the function does not change any mutable state visible to caller functions (e.g. memory).
对于使用 pure 属性注释的设备函数,设备代码优化器假定该函数不会更改对调用函数可见的任何可变状态(例如内存)。

For a device function annotated with the const attribute, the device code optimizer assumes that the function does not access or change any mutable state visible to caller functions (e.g. memory).
对于使用 const 属性注释的设备函数,设备代码优化器假定该函数不会访问或更改任何对调用函数(例如内存)可见的可变状态。

Example: 示例:

__attribute__((const)) __device__ int get(int in);

__device__ int doit(int in) {
int sum = 0;

//because 'get' is marked with 'const' attribute
//device code optimizer can recognize that the
//second call to get() can be commoned out.
sum = get(in);
sum += get(in);

return sum;
}

14.5.20. __nv_pure__ Attribute
14.5.20. __nv_pure__ 属性 

The __nv_pure__ attributed is supported for both host and device functions. For host functions, when using a language dialect that supports the pure GNU attribute, the __nv_pure__ attribute is translated to the pure GNU attribute. Similarly when using MSVC as the host compiler, the attribute is translated to the MSVC noalias attribute.
__nv_pure__ 属性支持主机和设备函数。对于主机函数,在使用支持 pure GNU 属性的语言方言时, __nv_pure__ 属性将被翻译为 pure GNU 属性。同样,当使用 MSVC 作为主机编译器时,该属性将被翻译为 MSVC noalias 属性。

When a device function is annotated with the __nv_pure__ attribute, the device code optimizer assumes that the function does not change any mutable state visible to caller functions (e.g. memory).
当设备函数使用 __nv_pure__ 属性进行注释时,设备代码优化器会假定该函数不会更改对调用函数(例如内存)可见的任何可变状态。

14.5.21. Intel Host Compiler Specific
14.5.21. Intel 主机编译器特定 

The CUDA frontend compiler parser does not recognize some of the intrinsic functions supported by the Intel compiler (e.g. icc). When using the Intel compiler as a host compiler, nvcc will therefore enable the macro __INTEL_COMPILER_USE_INTRINSIC_PROTOTYPES during preprocessing. This macro enables explicit declarations of the Intel compiler intrinsic functions in the associated header files, allowing nvcc to support use of such functions in host code19.
CUDA 前端编译器解析器无法识别英特尔编译器支持的某些内部函数(例如 icc )。当将英特尔编译器用作主机编译器时,因此在预处理期间将启用宏 __INTEL_COMPILER_USE_INTRINSIC_PROTOTYPES 。此宏允许在相关头文件中显式声明英特尔编译器内部函数,从而使 nvcc 能够支持在主机代码 19 中使用这些函数。

14.5.22. C++11 Features
14.5.22. C++11 功能 

C++11 features that are enabled by default by the host compiler are also supported by nvcc, subject to the restrictions described in this document. In addition, invoking nvcc with -std=c++11 flag turns on all C++11 features and also invokes the host preprocessor, compiler and linker with the corresponding C++11 dialect option 20.
默认情况下由主机编译器启用的 C++11 功能也受 nvcc 支持,但受本文档中描述的限制约束。此外,使用 -std=c++11 标志调用 nvcc 会打开所有 C++11 功能,并使用相应的 C++11 方言选项 20 调用主机预处理器、编译器和链接器。

14.5.22.1. Lambda Expressions
14.5.22.1. Lambda 表达式 

The execution space specifiers for all member functions21 of the closure class associated with a lambda expression are derived by the compiler as follows. As described in the C++11 standard, the compiler creates a closure type in the smallest block scope, class scope or namespace scope that contains the lambda expression. The innermost function scope enclosing the closure type is computed, and the corresponding function’s execution space specifiers are assigned to the closure class member functions. If there is no enclosing function scope, the execution space specifier is __host__.
与 lambda 表达式相关的闭包类的所有成员函数的执行空间限定符由编译器如下派生。如 C++11 标准所述,编译器在包含 lambda 表达式的最小块作用域、类作用域或命名空间作用域中创建闭包类型。计算封闭闭包类型的最内层函数作用域,并将相应函数的执行空间限定符分配给闭包类成员函数。如果没有封闭函数作用域,则执行空间限定符为 __host__

Examples of lambda expressions and computed execution space specifiers are shown below (in comments).
下面显示了 lambda 表达式和计算执行空间说明符的示例(在注释中)。

auto globalVar = [] { return 0; }; // __host__

void f1(void) {
  auto l1 = [] { return 1; };      // __host__
}

__device__ void f2(void) {
  auto l2 = [] { return 2; };      // __device__
}

__host__ __device__ void f3(void) {
  auto l3 = [] { return 3; };      // __host__ __device__
}

__device__ void f4(int (*fp)() = [] { return 4; } /* __host__ */) {
}

__global__ void f5(void) {
  auto l5 = [] { return 5; };      // __device__
}

__device__ void f6(void) {
  struct S1_t {
    static void helper(int (*fp)() = [] {return 6; } /* __device__ */) {
    }
  };
}

The closure type of a lambda expression cannot be used in the type or non-type argument of a __global__ function template instantiation, unless the lambda is defined within a __device__ or __global__ function.
lambda 表达式的闭包类型不能在 __global__ 函数模板实例化的类型或非类型参数中使用,除非 lambda 表达式是在 __device____global__ 函数内定义的。

Example: 示例:

template <typename T>
__global__ void foo(T in) { };

template <typename T>
struct S1_t { };

void bar(void) {
  auto temp1 = [] { };

  foo<<<1,1>>>(temp1);                    // error: lambda closure type used in
                                          // template type argument
  foo<<<1,1>>>( S1_t<decltype(temp1)>()); // error: lambda closure type used in
                                          // template type argument
}

14.5.22.2. std::initializer_list

By default, the CUDA compiler will implicitly consider the member functions of std::initializer_list to have __host__ __device__ execution space specifiers, and therefore they can be invoked directly from device code. The nvcc flag --no-host-device-initializer-list will disable this behavior; member functions of std::initializer_list will then be considered as __host__ functions and will not be directly invokable from device code.
默认情况下,CUDA 编译器将隐式地认为 std::initializer_list 的成员函数具有 __host__ __device__ 执行空间限定符,因此它们可以直接从设备代码中调用。nvcc 标志 --no-host-device-initializer-list 将禁用此行为;然后 std::initializer_list 的成员函数将被视为 __host__ 函数,并且不能直接从设备代码中调用。

Example: 示例:

#include <initializer_list>

__device__ int foo(std::initializer_list<int> in);

__device__ void bar(void)
  {
    foo({4,5,6});   // (a) initializer list containing only
                    // constant expressions.

    int i = 4;
    foo({i,5,6});   // (b) initializer list with at least one
                    // non-constant element.
                    // This form may have better performance than (a).
  }

14.5.22.3. Rvalue references
14.5.22.3. 右值引用 

By default, the CUDA compiler will implicitly consider std::move and std::forward function templates to have __host__ __device__ execution space specifiers, and therefore they can be invoked directly from device code. The nvcc flag --no-host-device-move-forward will disable this behavior; std::move and std::forward will then be considered as __host__ functions and will not be directly invokable from device code.
默认情况下,CUDA 编译器将隐式地将 std::movestd::forward 函数模板视为具有 __host__ __device__ 执行空间限定符,因此它们可以直接从设备代码中调用。 nvcc 标志 --no-host-device-move-forward 将禁用此行为;然后 std::movestd::forward 将被视为 __host__ 函数,并且不能直接从设备代码中调用。

14.5.22.4. Constexpr functions and function templates
14.5.22.4. Constexpr 函数和函数模板 

By default, a constexpr function cannot be called from a function with incompatible execution space 22. The experimental nvcc flag --expt-relaxed-constexpr removes this restriction 23. When this flag is specified, host code can invoke a __device__ constexpr function and device code can invoke a __host__ constexpr function. nvcc will define the macro __CUDACC_RELAXED_CONSTEXPR__ when --expt-relaxed-constexpr has been specified. Note that a function template instantiation may not be a constexpr function even if the corresponding template is marked with the keyword constexpr (C++11 Standard Section [dcl.constexpr.p6]).
默认情况下,constexpr 函数无法从执行空间不兼容的函数中调用 22。实验性 nvcc 标志 --expt-relaxed-constexpr 可以移除此限制 23。当指定了此标志时,主机代码可以调用 __device__ constexpr 函数,设备代码可以调用 __host__ constexpr 函数。当指定了 --expt-relaxed-constexpr 时,nvcc 将定义宏 __CUDACC_RELAXED_CONSTEXPR__ 。请注意,即使相应的模板标记有关键字 constexpr (C++11 标准第 [dcl.constexpr.p6] 节),函数模板实例化可能不是 constexpr 函数。

14.5.22.5. Constexpr variables
14.5.22.5. Constexpr 变量 

Let ‘V’ denote a namespace scope variable or a class static member variable that has been marked constexpr and that does not have execution space annotations (e.g., __device__, __constant__, __shared__). V is considered to be a host code variable.
让“V”表示一个已标记为 constexpr 且没有执行空间注释(例如 __device__, __constant__, __shared__ )的命名空间范围变量或类静态成员变量。V 被视为主机代码变量。

If V is of scalar type 24 other than long double and the type is not volatile-qualified, the value of V can be directly used in device code. In addition, if V is of a non-scalar type then scalar elements of V can be used inside a constexpr __device__ or __host__ __device__ function, if the call to the function is a constant expression 25. Device source code cannot contain a reference to V or take the address of V.
如果 V 是除 long double 之外的标量类型 24,并且该类型不是 volatile 限定的,则可以直接在设备代码中使用 V 的值。此外,如果 V 是非标量类型,则可以在 constexpr __device____host__ __device__ 函数内使用 V 的标量元素,如果对函数的调用是常量表达式 25。设备源代码不能包含对 V 的引用或获取 V 的地址。

Example: 示例:

constexpr int xxx = 10;
constexpr int yyy = xxx + 4;
struct S1_t { static constexpr int qqq = 100; };

constexpr int host_arr[] = { 1, 2, 3};
constexpr __device__ int get(int idx) { return host_arr[idx]; }

__device__ int foo(int idx) {
  int v1 = xxx + yyy + S1_t::qqq;  // OK
  const int &v2 = xxx;             // error: reference to host constexpr
                                   // variable
  const int *v3 = &xxx;            // error: address of host constexpr
                                   // variable
  const int &v4 = S1_t::qqq;       // error: reference to host constexpr
                                   // variable
  const int *v5 = &S1_t::qqq;      // error: address of host constexpr
                                   // variable

  v1 += get(2);                    // OK: 'get(2)' is a constant
                                   // expression.
  v1 += get(idx);                  // error: 'get(idx)' is not a constant
                                   // expression
  v1 += host_arr[2];               // error: 'host_arr' does not have
                                   // scalar type.
  return v1;
}

14.5.22.6. Inline namespaces
14.5.22.6. 内联命名空间 

For an input CUDA translation unit, the CUDA compiler may invoke the host compiler for compiling the host code within the translation unit. In the code passed to the host compiler, the CUDA compiler will inject additional compiler generated code, if the input CUDA translation unit contained a definition of any of the following entities:
对于输入的 CUDA 翻译单元,CUDA 编译器可能会调用主机编译器来编译翻译单元中的主机代码。在传递给主机编译器的代码中,如果输入的 CUDA 翻译单元包含以下实体的定义,则 CUDA 编译器将注入额外的编译器生成的代码:

  • __global__ function or function template instantiation
    __global__ 函数或函数模板实例化

  • __device__, __constant__

  • variables with surface or texture type
    具有表面或纹理类型的变量

The compiler generated code contains a reference to the defined entity. If the entity is defined within an inline namespace and another entity of the same name and type signature is defined in an enclosing namespace, this reference may be considered ambiguous by the host compiler and host compilation will fail.
编译器生成的代码包含对已定义实体的引用。如果实体在内联命名空间中定义,并且在封闭命名空间中定义了另一个同名且类型签名相同的实体,则主机编译器可能会认为此引用存在歧义,并且主机编译将失败。

This limitation can be avoided by using unique names for such entities defined within an inline namespace.
可以通过为内联命名空间中定义的这些实体使用唯一名称来避免这种限制。

Example: 示例:

__device__ int Gvar;
inline namespace N1 {
  __device__ int Gvar;
}

// <-- CUDA compiler inserts a reference to "Gvar" at this point in the
// translation unit. This reference will be considered ambiguous by the
// host compiler and compilation will fail.

Example: 示例:

inline namespace N1 {
  namespace N2 {
    __device__ int Gvar;
  }
}

namespace N2 {
  __device__ int Gvar;
}

// <-- CUDA compiler inserts reference to "::N2::Gvar" at this point in
// the translation unit. This reference will be considered ambiguous by
// the host compiler and compilation will fail.
14.5.22.6.1. Inline unnamed namespaces
14.5.22.6.1. 内联未命名命名空间 

The following entities cannot be declared in namespace scope within an inline unnamed namespace:
无法在内联未命名命名空间中的命名空间范围内声明以下实体:

  • __managed__, __device__, __shared__ and __constant__ variables
    __managed____device____shared____constant__ 变量

  • __global__ function and function templates
    __global__ 函数和函数模板

  • variables with surface or texture type
    具有表面或纹理类型的变量

Example: 示例:

inline namespace {
  namespace N2 {
    template <typename T>
    __global__ void foo(void);            // error

    __global__ void bar(void) { }         // error

    template <>
    __global__ void foo<int>(void) { }    // error

    __device__ int x1b;                   // error
    __constant__ int x2b;                 // error
    __shared__ int x3b;                   // error

    texture<int> q2;                      // error
    surface<int> s2;                      // error
  }
};

14.5.22.7. thread_local

The thread_local storage specifier is not allowed in device code.
设备代码中不允许使用 thread_local 存储说明符。

14.5.22.8. __global__ functions and function templates
14.5.22.8. __global__ 函数和函数模板 

If the closure type associated with a lambda expression is used in a template argument of a __global__ function template instantiation, the lambda expression must either be defined in the immediate or nested block scope of a __device__ or __global__ function, or must be an extended lambda.
如果与 lambda 表达式相关联的闭包类型在 __global__ 函数模板实例化的模板参数中使用,则 lambda 表达式必须在 __device____global__ 函数的直接或嵌套块作用域中定义,或者必须是扩展 lambda。

Example: 示例:

template <typename T>
__global__ void kernel(T in) { }

__device__ void foo_device(void)
{
  // All kernel instantiations in this function
  // are valid, since the lambdas are defined inside
  // a __device__ function.

  kernel<<<1,1>>>( [] __device__ { } );
  kernel<<<1,1>>>( [] __host__ __device__ { } );
  kernel<<<1,1>>>( []  { } );
}

auto lam1 = [] { };

auto lam2 = [] __host__ __device__ { };

void foo_host(void)
{
   // OK: instantiated with closure type of an extended __device__ lambda
   kernel<<<1,1>>>( [] __device__ { } );

   // OK: instantiated with closure type of an extended __host__ __device__
   // lambda
   kernel<<<1,1>>>( [] __host__ __device__ { } );

   // error: unsupported: instantiated with closure type of a lambda
   // that is not an extended lambda
   kernel<<<1,1>>>( []  { } );

   // error: unsupported: instantiated with closure type of a lambda
   // that is not an extended lambda
   kernel<<<1,1>>>( lam1);

   // error: unsupported: instantiated with closure type of a lambda
   // that is not an extended lambda
   kernel<<<1,1>>>( lam2);
}

A __global__ function or function template cannot be declared as constexpr.
无法将 __global__ 函数或函数模板声明为 constexpr

A __global__ function or function template cannot have a parameter of type std::initializer_list or va_list.
一个 __global__ 函数或函数模板不能有类型为 std::initializer_listva_list 的参数。

A __global__ function cannot have a parameter of rvalue reference type.
一个 __global__ 函数不能具有右值引用类型的参数。

A variadic __global__ function template has the following restrictions:
可变参数 __global__ 函数模板有以下限制:

  • Only a single pack parameter is allowed.
    只允许一个 pack 参数。

  • The pack parameter must be listed last in the template parameter list.
    模板参数列表中必须将 pack 参数列在最后。

Example: 示例:

// ok
template <template <typename...> class Wrapper, typename... Pack>
__global__ void foo1(Wrapper<Pack...>);

// error: pack parameter is not last in parameter list
template <typename... Pack, template <typename...> class Wrapper>
__global__ void foo2(Wrapper<Pack...>);

// error: multiple parameter packs
template <typename... Pack1, int...Pack2, template<typename...> class Wrapper1,
          template<int...> class Wrapper2>
__global__ void foo3(Wrapper1<Pack1...>, Wrapper2<Pack2...>);

14.5.22.9. __managed__ and __shared__ variables
14.5.22.9. __managed__ 和 __shared__ 变量 

`__managed__ and __shared__ variables cannot be marked with the keyword constexpr.
`__managed____shared__ 变量不能标记为关键字 constexpr

14.5.22.10. Defaulted functions
14.5.22.10. 默认函数 

Execution space specifiers on a function that is explicitly-defaulted on its first declaration are ignored by the CUDA compiler. Instead, the CUDA compiler will infer the execution space specifiers as described in Implicitly-declared and explicitly-defaulted functions.
CUDA 编译器会忽略在首次声明上显式默认的函数上的执行空间限定符。相反,CUDA 编译器将推断执行空间限定符,如隐式声明和显式默认函数中所述。

Execution space specifiers are not ignored if the function is explicitly-defaulted, but not on its first declaration.
如果函数被显式默认,但不是在其第一次声明时,执行空间修饰符不会被忽略。

Example: 示例:

struct S1 {
  // warning: __host__ annotation is ignored on a function that
  //          is explicitly-defaulted on its first declaration
  __host__ S1() = default;
};

__device__ void foo1() {
  //note: __device__ execution space is derived for S1::S1
  //       based on implicit call from within __device__ function
  //       foo1
  S1 s1;
}

struct S2 {
  __host__ S2();
};

//note: S2::S2 is not defaulted on its first declaration, and
//      its execution space is fixed to __host__  based on its
//      first declaration.
S2::S2() = default;

__device__ void foo2() {
   // error: call from __device__ function 'foo2' to
   //        __host__ function 'S2::S2'
   S2 s2;
}

14.5.23. C++14 Features
14.5.23. C++14 功能 

C++14 features enabled by default by the host compiler are also supported by nvcc. Passing nvcc -std=c++14 flag turns on all C++14 features and also invokes the host preprocessor, compiler and linker with the corresponding C++14 dialect option 26. This section describes the restrictions on the supported C++14 features.
默认情况下由主机编译器启用的 C++14 功能也受 nvcc 支持。传递 nvcc -std=c++14 标志会打开所有 C++14 功能,并使用相应的 C++14 方言选项 26 调用主机预处理器、编译器和链接器。本节描述了受支持的 C++14 功能的限制。

14.5.23.1. Functions with deduced return type
14.5.23.1. 具有推导返回类型的函数 

A __global__ function cannot have a deduced return type.
一个 __global__ 函数不能有一个推断的返回类型。

If a __device__ function has deduced return type, the CUDA frontend compiler will change the function declaration to have a void return type, before invoking the host compiler. This may cause issues for introspecting the deduced return type of the __device__ function in host code. Thus, the CUDA compiler will issue compile-time errors for referencing such deduced return type outside device function bodies, except if the reference is absent when __CUDA_ARCH__ is undefined.
如果 __device__ 函数具有推断的返回类型,CUDA 前端编译器将在调用主机编译器之前更改函数声明,使其具有 void 返回类型。这可能会导致在主机代码中检查 __device__ 函数的推断返回类型时出现问题。因此,CUDA 编译器将对在设备函数体外引用此类推断返回类型发出编译时错误,除非在 __CUDA_ARCH__ 未定义时引用不存在。

Examples: 示例:

__device__ auto fn1(int x) {
  return x;
}

__device__ decltype(auto) fn2(int x) {
  return x;
}

__device__ void device_fn1() {
  // OK
  int (*p1)(int) = fn1;
}

// error: referenced outside device function bodies
decltype(fn1(10)) g1;

void host_fn1() {
  // error: referenced outside device function bodies
  int (*p1)(int) = fn1;

  struct S_local_t {
    // error: referenced outside device function bodies
    decltype(fn2(10)) m1;

    S_local_t() : m1(10) { }
  };
}

// error: referenced outside device function bodies
template <typename T = decltype(fn2)>
void host_fn2() { }

template<typename T> struct S1_t { };

// error: referenced outside device function bodies
struct S1_derived_t : S1_t<decltype(fn1)> { };

14.5.23.2. Variable templates
14.5.23.2. 变量模板 

A __device__/__constant__ variable template cannot have a const qualified type when using the Microsoft host compiler.
使用 Microsoft 主机编译器时, __device__/__constant__ 变量模板不能具有 const 限定类型。

Examples: 示例:

// error: a __device__ variable template cannot
// have a const qualified type on Windows
template <typename T>
__device__ const T d1(2);

int *const x = nullptr;
// error: a __device__ variable template cannot
// have a const qualified type on Windows
template <typename T>
__device__ T *const d2(x);

// OK
template <typename T>
__device__ const T *d3;

__device__ void fn() {
  int t1 = d1<int>;

  int *const t2 = d2<int>;

  const int *t3 = d3<int>;
}

14.5.24. C++17 Features
14.5.24. C++17 功能 

C++17 features enabled by default by the host compiler are also supported by nvcc. Passing nvcc -std=c++17 flag turns on all C++17 features and also invokes the host preprocessor, compiler and linker with the corresponding C++17 dialect option 27. This section describes the restrictions on the supported C++17 features.
宿主编译器默认启用的 C++17 功能也受 nvcc 支持。传递 nvcc -std=c++17 标志会打开所有 C++17 功能,并使用相应的 C++17 方言选项 27 调用宿主预处理器、编译器和链接器。本节描述了受支持的 C++17 功能的限制。

14.5.24.1. Inline Variable
14.5.24.1. 内联变量 

  • A namespace scope inline variable declared with __device__ or __constant__ or __managed__ memory space specifier must have internal linkage, if the code is compiled with nvcc in whole program compilation mode.
    使用 __device____constant____managed__ 内存空间说明符声明的命名空间范围内联变量必须具有内部链接,如果代码在整个程序编译模式下使用 nvcc 编译。

    Examples: 示例:

    inline __device__ int xxx; //error when compiled with nvcc in
                               //whole program compilation mode.
                               //ok when compiled with nvcc in
                               //separate compilation mode.
    
    inline __shared__ int yyy0; // ok.
    
    static inline __device__ int yyy; // ok: internal linkage
    namespace {
    inline __device__ int zzz; // ok: internal linkage
    }
    
  • When using g++ host compiler, an inline variable declared with __managed__ memory space specifier may not be visible to the debugger.
    当使用 g++ 主机编译器时,使用 __managed__ 内存空间说明符声明的内联变量可能对调试器不可见。

14.5.24.2. Structured Binding
14.5.24.2. 结构化绑定 

A structured binding cannot be declared with a variable memory space specifier.
结构化绑定不能与变量内存空间说明符一起声明。

Example: 示例:

struct S { int x; int y; };
__device__ auto [a1, b1] = S{4,5}; // error

14.5.25. C++20 Features
14.5.25. C++20 功能 

C++20 features enabled by default by the host compiler are also supported by nvcc. Passing nvcc -std=c++20 flag turns on all C++20 features and also invokes the host preprocessor, compiler and linker with the corresponding C++20 dialect option 28. This section describes the restrictions on the supported C++20 features.
主机编译器默认启用的 C++20 功能也受 nvcc 支持。传递 nvcc -std=c++20 标志会打开所有 C++20 功能,并使用相应的 C++20 方言选项 28 调用主机预处理器、编译器和链接器。本节描述了受支持的 C++20 功能的限制。

14.5.25.1. Module support
14.5.25.1. 模块支持 

Modules are not supported in CUDA C++, in either host or device code. Uses of the module, export and import keywords are diagnosed as errors.
CUDA C++ 不支持模块,在主机或设备代码中都不支持。对 moduleexportimport 关键字的使用被诊断为错误。

14.5.25.2. Coroutine support
14.5.25.2. 协程支持 

Coroutines are not supported in device code. Uses of the co_await, co_yield and co_return keywords in the scope of a device function are diagnosed as error during device compilation.
协程不支持设备代码。在设备函数范围内使用 co_awaitco_yieldco_return 关键字将在设备编译期间诊断为错误。

14.5.25.3. Three-way comparison operator
14.5.25.3. 三路比较运算符 

The three-way comparison operator is supported in both host and device code, but some uses implicitly rely on functionality from the Standard Template Library provided by the host implementation. Uses of those operators may require specifying the flag --expt-relaxed-constexpr to silence warnings and the functionality requires that the host implementation satisfies the requirements of device code.
三路比较运算符在主机和设备代码中都受支持,但某些用法隐含地依赖于主机实现提供的标准模板库功能。使用这些运算符可能需要指定标志 --expt-relaxed-constexpr 以消除警告,并且功能要求主机实现满足设备代码的要求。

Example: 示例:

#include<compare>
struct S {
  int x, y, z;
  auto operator<=>(const S& rhs) const = default;
  __host__ __device__ bool operator<=>(int rhs) const { return false; }
};
__host__ __device__ bool f(S a, S b) {
  if (a <=> 1) // ok, calls a user-defined host-device overload
    return true;
  return a < b; // call to an implicitly-declared function and requires
                // a device-compatible std::strong_ordering implementation
}

14.5.25.4. Consteval functions
14.5.25.4. Consteval 函数 

Ordinarily, cross execution space calls are not allowed, and cause a compiler diagnostic (warning or error). This restriction does not apply when the called function is declared with the consteval specifier. Thus, a __device__ or __global__ function can call a __host__consteval function, and a __host__ function can call a __device__ consteval function.
通常情况下,不允许跨执行空间调用,并会导致编译器诊断(警告或错误)。当被调用的函数声明为 consteval 时,此限制不适用。因此, __device____global__ 函数可以调用 __host__ consteval 函数, __host__ 函数可以调用 __device__ consteval 函数。

Example: 示例:

namespace N1 {
//consteval host function
consteval int hcallee() { return 10; }

__device__ int dfunc() { return hcallee(); /* OK */ }
__global__ void gfunc() { (void)hcallee(); /* OK */ }
__host__ __device__ int hdfunc() { return hcallee();  /* OK */ }
int hfunc() { return hcallee(); /* OK */ }
} // namespace N1


namespace N2 {
//consteval device function
consteval __device__ int dcallee() { return 10; }

__device__ int dfunc() { return dcallee(); /* OK */ }
__global__ void gfunc() { (void)dcallee(); /* OK */ }
__host__ __device__ int hdfunc() { return dcallee();  /* OK */ }
int hfunc() { return dcallee(); /* OK */ }
}

14.6. Polymorphic Function Wrappers
14.6. 多态函数包装器 

A polymorphic function wrapper class template nvstd::function is provided in the nvfunctional header. Instances of this class template can be used to store, copy and invoke any callable target, e.g., lambda expressions. nvstd::function can be used in both host and device code.
nvfunctional 头文件中提供了一个多态函数包装类模板 nvstd::function 。这个类模板的实例可以用来存储、复制和调用任何可调用目标,例如 lambda 表达式。 nvstd::function 可以在主机代码和设备代码中使用。

Example: 示例:

#include <nvfunctional>

__device__ int foo_d() { return 1; }
__host__ __device__ int foo_hd () { return 2; }
__host__ int foo_h() { return 3; }

__global__ void kernel(int *result) {
  nvstd::function<int()> fn1 = foo_d;
  nvstd::function<int()> fn2 = foo_hd;
  nvstd::function<int()> fn3 =  []() { return 10; };

  *result = fn1() + fn2() + fn3();
}

__host__ __device__ void hostdevice_func(int *result) {
  nvstd::function<int()> fn1 = foo_hd;
  nvstd::function<int()> fn2 =  []() { return 10; };

  *result = fn1() + fn2();
}

__host__ void host_func(int *result) {
  nvstd::function<int()> fn1 = foo_h;
  nvstd::function<int()> fn2 = foo_hd;
  nvstd::function<int()> fn3 =  []() { return 10; };

  *result = fn1() + fn2() + fn3();
}

Instances of nvstd::function in host code cannot be initialized with the address of a __device__ function or with a functor whose operator() is a __device__ function. Instances of nvstd::function in device code cannot be initialized with the address of a __host__ function or with a functor whose operator() is a __host__ function.
主机代码中的 nvstd::function 实例不能使用 __device__ 函数的地址或其 operator()__device__ 函数的函数对象进行初始化。设备代码中的 nvstd::function 实例不能使用 __host__ 函数的地址或其 operator()__host__ 函数的函数对象进行初始化。

nvstd::function instances cannot be passed from host code to device code (and vice versa) at run time. nvstd::function cannot be used in the parameter type of a __global__ function, if the __global__ function is launched from host code.
nvstd::function 实例无法在运行时从主机代码传递到设备代码(反之亦然)。如果从主机代码启动 __global__ 函数,则无法在 __global__ 函数的参数类型中使用 nvstd::function

Example: 示例:

#include <nvfunctional>

__device__ int foo_d() { return 1; }
__host__ int foo_h() { return 3; }
auto lam_h = [] { return 0; };

__global__ void k(void) {
  // error: initialized with address of __host__ function
  nvstd::function<int()> fn1 = foo_h;

  // error: initialized with address of functor with
  // __host__ operator() function
  nvstd::function<int()> fn2 = lam_h;
}

__global__ void kern(nvstd::function<int()> f1) { }

void foo(void) {
  // error: initialized with address of __device__ function
  nvstd::function<int()> fn1 = foo_d;

  auto lam_d = [=] __device__ { return 1; };

  // error: initialized with address of functor with
  // __device__ operator() function
  nvstd::function<int()> fn2 = lam_d;

  // error: passing nvstd::function from host to device
  kern<<<1,1>>>(fn2);
}

nvstd::function is defined in the nvfunctional header as follows:
nvstd::functionnvfunctional 头文件中定义如下:

namespace nvstd {
  template <class _RetType, class ..._ArgTypes>
  class function<_RetType(_ArgTypes...)>
  {
    public:
      // constructors
      __device__ __host__  function() noexcept;
      __device__ __host__  function(nullptr_t) noexcept;
      __device__ __host__  function(const function &);
      __device__ __host__  function(function &&);

      template<class _F>
      __device__ __host__  function(_F);

      // destructor
      __device__ __host__  ~function();

      // assignment operators
      __device__ __host__  function& operator=(const function&);
      __device__ __host__  function& operator=(function&&);
      __device__ __host__  function& operator=(nullptr_t);
      __device__ __host__  function& operator=(_F&&);

      // swap
      __device__ __host__  void swap(function&) noexcept;

      // function capacity
      __device__ __host__  explicit operator bool() const noexcept;

      // function invocation
      __device__ _RetType operator()(_ArgTypes...) const;
  };

  // null pointer comparisons
  template <class _R, class... _ArgTypes>
  __device__ __host__
  bool operator==(const function<_R(_ArgTypes...)>&, nullptr_t) noexcept;

  template <class _R, class... _ArgTypes>
  __device__ __host__
  bool operator==(nullptr_t, const function<_R(_ArgTypes...)>&) noexcept;

  template <class _R, class... _ArgTypes>
  __device__ __host__
  bool operator!=(const function<_R(_ArgTypes...)>&, nullptr_t) noexcept;

  template <class _R, class... _ArgTypes>
  __device__ __host__
  bool operator!=(nullptr_t, const function<_R(_ArgTypes...)>&) noexcept;

  // specialized algorithms
  template <class _R, class... _ArgTypes>
  __device__ __host__
  void swap(function<_R(_ArgTypes...)>&, function<_R(_ArgTypes...)>&);
}

14.7. Extended Lambdas
14.7. 扩展 Lambda 表达式 

The nvcc flag '--extended-lambda' allows explicit execution space annotations in a lambda expression 29. The execution space annotations should be present after the ‘lambda-introducer’ and before the optional ‘lambda-declarator’. nvcc will define the macro __CUDACC_EXTENDED_LAMBDA__ when the '--extended-lambda' flag has been specified.
nvcc 标志 '--extended-lambda' 允许在 lambda 表达式 29 中进行显式执行空间注释。执行空间注释应该出现在“lambda-introducer”之后,可选的“lambda-declarator”之前。当指定了 '--extended-lambda' 标志时,nvcc 将定义宏 __CUDACC_EXTENDED_LAMBDA__

An ‘extended __device__ lambda’ is a lambda expression that is annotated explicitly with ‘__device__’, and is defined within the immediate or nested block scope of a __host__ or __host__ __device__ function.
一个“扩展 __device__ lambda”是一个 lambda 表达式,明确地带有‘ __device__ ’注释,并且在 __host____host__ __device__ 函数的直接或嵌套块范围内定义。

An ‘extended __host__ __device__ lambda’ is a lambda expression that is annotated explicitly with both ‘__host__’ and ‘__device__’, and is defined within the immediate or nested block scope of a __host__ or __host__ __device__ function.
一个“扩展 __host__ __device__ lambda”是一个 lambda 表达式,明确地带有‘ __host__ ’和‘ __device__ ’的注释,并且在 __host____host__ __device__ 函数的直接或嵌套块范围内定义。

An ‘extended lambda’ denotes either an extended __device__ lambda or an extended __host__ __device__ lambda. Extended lambdas can be used in the type arguments of __global__ function template instantiation.
“‘扩展 lambda’ 表示扩展 __device__ lambda 或扩展 __host__ __device__ lambda。扩展 lambda 可以用作 __global__ 函数模板实例化的类型参数。”

If the execution space annotations are not explicitly specified, they are computed based on the scopes enclosing the closure class associated with the lambda, as described in the section on C++11 support. The execution space annotations are applied to all methods of the closure class associated with the lambda.
如果未明确指定执行空间注释,则根据封闭 lambda 关联的闭包类的作用域进行计算,如 C++11 支持部分所述。执行空间注释应用于与 lambda 关联的闭包类的所有方法。

Example: 示例:

void foo_host(void) {
  // not an extended lambda: no explicit execution space annotations
  auto lam1 = [] { };

  // extended __device__ lambda
  auto lam2 = [] __device__ { };

  // extended __host__ __device__ lambda
  auto lam3 = [] __host__ __device__ { };

  // not an extended lambda: explicitly annotated with only '__host__'
  auto lam4 = [] __host__ { };
}

__host__ __device__ void foo_host_device(void) {
  // not an extended lambda: no explicit execution space annotations
  auto lam1 = [] { };

  // extended __device__ lambda
  auto lam2 = [] __device__ { };

  // extended __host__ __device__ lambda
  auto lam3 = [] __host__ __device__ { };

  // not an extended lambda: explicitly annotated with only '__host__'
  auto lam4 = [] __host__ { };
}

__device__ void foo_device(void) {
  // none of the lambdas within this function are extended lambdas,
  // because the enclosing function is not a __host__ or __host__ __device__
  // function.
  auto lam1 = [] { };
  auto lam2 = [] __device__ { };
  auto lam3 = [] __host__ __device__ { };
  auto lam4 = [] __host__ { };
}

// lam1 and lam2 are not extended lambdas because they are not defined
// within a __host__ or __host__ __device__ function.
auto lam1 = [] { };
auto lam2 = [] __host__ __device__ { };

14.7.1. Extended Lambda Type Traits
14.7.1. 扩展 Lambda 类型特征 

The compiler provides type traits to detect closure types for extended lambdas at compile time:
编译器提供类型特征来在编译时检测扩展 lambda 的闭包类型:

__nv_is_extended_device_lambda_closure_type(type): If ‘type’ is the closure class created for an extended __device__ lambda, then the trait is true, otherwise it is false.
__nv_is_extended_device_lambda_closure_type(type) :如果“type”是为扩展 __device__ lambda 创建的闭包类,则特质为 true,否则为 false。

__nv_is_extended_device_lambda_with_preserved_return_type(type): If ‘type’ is the closure class created for an extended __device__ lambda and the lambda is defined with trailing return type (with restriction), then the trait is true, otherwise it is false. If the trailing return type definition refers to any lambda parameter name, the return type is not preserved.
__nv_is_extended_device_lambda_with_preserved_return_type(type) :如果“type”是为扩展 __device__ lambda 创建的闭包类,并且 lambda 是使用尾返回类型(受限制)定义的,则特性为 true,否则为 false。如果尾返回类型定义引用任何 lambda 参数名称,则不会保留返回类型。

__nv_is_extended_host_device_lambda_closure_type(type): If ‘type’ is the closure class created for an extended __host__ __device__ lambda, then the trait is true, otherwise it is false.
__nv_is_extended_host_device_lambda_closure_type(type) :如果“type”是为扩展 __host__ __device__ lambda 创建的闭包类,则特质为 true,否则为 false。

These traits can be used in all compilation modes, irrespective of whether lambdas or extended lambdas are enabled30.
这些特性可以在所有编译模式中使用,无论是否启用了 lambda 或扩展 lambda。

Example: 示例:

#define IS_D_LAMBDA(X) __nv_is_extended_device_lambda_closure_type(X)
#define IS_DPRT_LAMBDA(X) __nv_is_extended_device_lambda_with_preserved_return_type(X)
#define IS_HD_LAMBDA(X) __nv_is_extended_host_device_lambda_closure_type(X)

auto lam0 = [] __host__ __device__ { };

void foo(void) {
  auto lam1 = [] { };
  auto lam2 = [] __device__ { };
  auto lam3 = [] __host__ __device__ { };
  auto lam4 = [] __device__ () --> double { return 3.14; }
  auto lam5 = [] __device__ (int x) --> decltype(&x) { return 0; }

  // lam0 is not an extended lambda (since defined outside function scope)
  static_assert(!IS_D_LAMBDA(decltype(lam0)), "");
  static_assert(!IS_DPRT_LAMBDA(decltype(lam0)), "");
  static_assert(!IS_HD_LAMBDA(decltype(lam0)), "");

  // lam1 is not an extended lambda (since no execution space annotations)
  static_assert(!IS_D_LAMBDA(decltype(lam1)), "");
  static_assert(!IS_DPRT_LAMBDA(decltype(lam1)), "");
  static_assert(!IS_HD_LAMBDA(decltype(lam1)), "");

  // lam2 is an extended __device__ lambda
  static_assert(IS_D_LAMBDA(decltype(lam2)), "");
  static_assert(!IS_DPRT_LAMBDA(decltype(lam2)), "");
  static_assert(!IS_HD_LAMBDA(decltype(lam2)), "");

  // lam3 is an extended __host__ __device__ lambda
  static_assert(!IS_D_LAMBDA(decltype(lam3)), "");
  static_assert(!IS_DPRT_LAMBDA(decltype(lam3)), "");
  static_assert(IS_HD_LAMBDA(decltype(lam3)), "");

  // lam4 is an extended __device__ lambda with preserved return type
  static_assert(IS_D_LAMBDA(decltype(lam4)), "");
  static_assert(IS_DPRT_LAMBDA(decltype(lam4)), "");
  static_assert(!IS_HD_LAMBDA(decltype(lam4)), "");

  // lam5 is not an extended __device__ lambda with preserved return type
  // because it references the operator()'s parameter types in the trailing return type.
  static_assert(IS_D_LAMBDA(decltype(lam5)), "");
  static_assert(!IS_DPRT_LAMBDA(decltype(lam5)), "");
  static_assert(!IS_HD_LAMBDA(decltype(lam5)), "");
}

14.7.2. Extended Lambda Restrictions
14.7.2. 扩展的 Lambda 限制 

The CUDA compiler will replace an extended lambda expression with an instance of a placeholder type defined in namespace scope, before invoking the host compiler. The template argument of the placeholder type requires taking the address of a function enclosing the original extended lambda expression. This is required for the correct execution of any __global__ function template whose template argument involves the closure type of an extended lambda. The enclosing function is computed as follows.
CUDA 编译器将在调用主机编译器之前,用在命名空间范围内定义的占位符类型的实例替换扩展的 lambda 表达式。占位符类型的模板参数需要获取封闭原始扩展 lambda 表达式的函数地址。这对于正确执行涉及扩展 lambda 闭包类型的任何 __global__ 函数模板是必需的。封闭函数的计算如下。

By definition, the extended lambda is present within the immediate or nested block scope of a __host__ or __host__ __device__ function. If this function is not the operator() of a lambda expression, then it is considered the enclosing function for the extended lambda. Otherwise, the extended lambda is defined within the immediate or nested block scope of the operator() of one or more enclosing lambda expressions. If the outermost such lambda expression is defined in the immediate or nested block scope of a function F, then F is the computed enclosing function, else the enclosing function does not exist.
根据定义,扩展的 lambda 函数存在于 __host____host__ __device__ 函数的直接或嵌套块作用域中。如果此函数不是 lambda 表达式的 operator() ,则被视为扩展 lambda 的封闭函数。否则,扩展的 lambda 将在一个或多个封闭 lambda 表达式的 operator() 的直接或嵌套块作用域中定义。如果最外层的 lambda 表达式是在函数 F 的直接或嵌套块作用域中定义的,则 F 是计算得到的封闭函数,否则封闭函数不存在。

Example: 示例:

void foo(void) {
  // enclosing function for lam1 is "foo"
  auto lam1 = [] __device__ { };

  auto lam2 = [] {
     auto lam3 = [] {
        // enclosing function for lam4 is "foo"
        auto lam4 = [] __host__ __device__ { };
     };
  };
}

auto lam6 = [] {
  // enclosing function for lam7 does not exist
  auto lam7 = [] __host__ __device__ { };
};

Here are the restrictions on extended lambdas:
这里是对扩展 lambda 的限制:

  1. An extended lambda cannot be defined inside another extended lambda expression.
    无法在另一个扩展的 lambda 表达式内定义扩展的 lambda。

    Example: 示例:

    void foo(void) {
      auto lam1 = [] __host__ __device__  {
        // error: extended lambda defined within another extended lambda
        auto lam2 = [] __host__ __device__ { };
      };
    }
    
  2. An extended lambda cannot be defined inside a generic lambda expression.
    无法在泛型 lambda 表达式内定义扩展 lambda。

    Example: 示例:

    void foo(void) {
      auto lam1 = [] (auto) {
        // error: extended lambda defined within a generic lambda
        auto lam2 = [] __host__ __device__ { };
      };
    }
    
  3. If an extended lambda is defined within the immediate or nested block scope of one or more nested lambda expression, the outermost such lambda expression must be defined inside the immediate or nested block scope of a function.
    如果在一个或多个嵌套的 lambda 表达式的直接或嵌套块作用域内定义了扩展 lambda,则最外层的 lambda 表达式必须在函数的直接或嵌套块作用域内定义。

    Example: 示例:

    auto lam1 = []  {
      // error: outer enclosing lambda is not defined within a
      // non-lambda-operator() function.
      auto lam2 = [] __host__ __device__ { };
    };
    
  4. The enclosing function for the extended lambda must be named and its address can be taken. If the enclosing function is a class member, then the following conditions must be satisfied:
    扩展的 lambda 的封闭函数必须具有名称,并且可以获取其地址。如果封闭函数是类成员,则必须满足以下条件:

    • All classes enclosing the member function must have a name.
      成员函数的封闭类必须有一个名称。

    • The member function must not have private or protected access within its parent class.
      成员函数在其父类中不能具有私有或受保护的访问权限。

    • All enclosing classes must not have private or protected access within their respective parent classes.
      所有封闭类在其各自的父类中不能具有私有或受保护的访问权限。

    Example: 示例:

    void foo(void) {
      // OK
      auto lam1 = [] __device__ { return 0; };
      {
        // OK
        auto lam2 = [] __device__ { return 0; };
        // OK
        auto lam3 = [] __device__ __host__ { return 0; };
      }
    }
    
    struct S1_t {
      S1_t(void) {
        // Error: cannot take address of enclosing function
        auto lam4 = [] __device__ { return 0; };
      }
    };
    
    class C0_t {
      void foo(void) {
        // Error: enclosing function has private access in parent class
        auto temp1 = [] __device__ { return 10; };
      }
      struct S2_t {
        void foo(void) {
          // Error: enclosing class S2_t has private access in its
          // parent class
          auto temp1 = [] __device__ { return 10; };
        }
      };
    };
    
  5. It must be possible to take the address of the enclosing routine unambiguously, at the point where the extended lambda has been defined. This may not be feasible in some cases e.g. when a class typedef shadows a template type argument of the same name.
    在定义扩展的 lambda 的地方,必须能够明确地获取封闭例程的地址。在某些情况下可能无法实现,例如当类 typedef 遮蔽了同名的模板类型参数。

    Example: 示例:

    template <typename> struct A {
      typedef void Bar;
      void test();
    };
    
    template<> struct A<void> { };
    
    template <typename Bar>
    void A<Bar>::test() {
      /* In code sent to host compiler, nvcc will inject an
         address expression here, of the form:
         (void (A< Bar> ::*)(void))(&A::test))
    
         However, the class typedef 'Bar' (to void) shadows the
         template argument 'Bar', causing the address
         expression in A<int>::test to actually refer to:
         (void (A< void> ::*)(void))(&A::test))
    
         ..which doesn't take the address of the enclosing
         routine 'A<int>::test' correctly.
      */
      auto lam1 = [] __host__ __device__ { return 4; };
    }
    
    int main() {
      A<int> xxx;
      xxx.test();
    }
    
  6. An extended lambda cannot be defined in a class that is local to a function.
    无法在函数的局部类中定义扩展的 lambda。

    Example: 示例:

    void foo(void) {
      struct S1_t {
        void bar(void) {
          // Error: bar is member of a class that is local to a function.
          auto lam4 = [] __host__ __device__ { return 0; };
        }
      };
    }
    
  7. The enclosing function for an extended lambda cannot have deduced return type.
    扩展 lambda 的封闭函数不能具有推导的返回类型。

    Example: 示例:

    auto foo(void) {
      // Error: the return type of foo is deduced.
      auto lam1 = [] __host__ __device__ { return 0; };
    }
    
  8. __host__ __device__ extended lambdas cannot be generic lambdas.
    __host__ __device__ 扩展的 lambda 不能是通用 lambda。

    Example: 示例:

    void foo(void) {
      // Error: __host__ __device__ extended lambdas cannot be
      // generic lambdas.
      auto lam1 = [] __host__ __device__ (auto i) { return i; };
    
      // Error: __host__ __device__ extended lambdas cannot be
      // generic lambdas.
      auto lam2 = [] __host__ __device__ (auto ...i) {
                   return sizeof...(i);
                  };
    }
    
  9. If the enclosing function is an instantiation of a function template or a member function template, and/or the function is a member of a class template, the template(s) must satisfy the following constraints:
    如果封闭函数是函数模板或成员函数模板的实例化,并且/或函数是类模板的成员,则模板必须满足以下约束条件:

    • The template must have at most one variadic parameter, and it must be listed last in the template parameter list.
      模板最多只能有一个可变参数,并且它必须在模板参数列表中列在最后。

    • The template parameters must be named.
      模板参数必须被命名。

    • The template instantiation argument types cannot involve types that are either local to a function (except for closure types for extended lambdas), or are private or protected class members.
      模板实例化参数类型不能涉及到函数局部类型(除了扩展 lambda 的闭包类型)或私有或受保护的类成员。

    Example: 示例:

    template <typename T>
    __global__ void kern(T in) { in(); }
    
    template <typename... T>
    struct foo {};
    
    template < template <typename...> class T, typename... P1,
              typename... P2>
    void bar1(const T<P1...>, const T<P2...>) {
      // Error: enclosing function has multiple parameter packs
      auto lam1 =  [] __device__ { return 10; };
    }
    
    template < template <typename...> class T, typename... P1,
              typename T2>
    void bar2(const T<P1...>, T2) {
      // Error: for enclosing function, the
      // parameter pack is not last in the template parameter list.
      auto lam1 =  [] __device__ { return 10; };
    }
    
    template <typename T, T>
    void bar3(void) {
      // Error: for enclosing function, the second template
      // parameter is not named.
      auto lam1 =  [] __device__ { return 10; };
    }
    
    int main() {
      foo<char, int, float> f1;
      foo<char, int> f2;
      bar1(f1, f2);
      bar2(f1, 10);
      bar3<int, 10>();
    }
    

    Example: 示例:

    template <typename T>
    __global__ void kern(T in) { in(); }
    
    template <typename T>
    void bar4(void) {
      auto lam1 =  [] __device__ { return 10; };
      kern<<<1,1>>>(lam1);
    }
    
    struct C1_t { struct S1_t { }; friend int main(void); };
    int main() {
      struct S1_t { };
      // Error: enclosing function for device lambda in bar4
      // is instantiated with a type local to main.
      bar4<S1_t>();
    
      // Error: enclosing function for device lambda in bar4
      // is instantiated with a type that is a private member
      // of a class.
      bar4<C1_t::S1_t>();
    }
    
  10. With Visual Studio host compilers, the enclosing function must have external linkage. The restriction is present because this host compiler does not support using the address of non-extern linkage functions as template arguments, which is needed by the CUDA compiler transformations to support extended lambdas.
    使用 Visual Studio 主机编译器时,封闭函数必须具有外部链接。存在此限制是因为此主机编译器不支持将非外部链接函数的地址用作模板参数,而 CUDA 编译器转换需要支持扩展 lambda。

  11. With Visual Studio host compilers, an extended lambda shall not be defined within the body of an ‘if-constexpr’ block.
    使用 Visual Studio 主机编译器,扩展的 Lambda 不得在“if-constexpr”块的主体内定义。

  12. An extended lambda has the following restrictions on captured variables:
    扩展的 lambda 对捕获的变量有以下限制:

    • In the code sent to the host compiler, the variable may be passed by value to a sequence of helper functions before being used to direct-initialize the field of the class type used to represent the closure type for the extended lambda31.
      在发送给主机编译器的代码中,变量可能会按值传递给一系列辅助函数,然后再用于直接初始化用于表示扩展 lambda 31 的闭包类型的类类型字段。

    • A variable can only be captured by value.
      变量只能按值捕获。

    • A variable of array type cannot be captured if the number of array dimensions is greater than 7.
      如果数组维度大于 7,则无法捕获数组类型的变量。

    • For a variable of array type, in the code sent to the host compiler, the closure type’s array field is first default-initialized, and then each element of the array field is copy-assigned from the corresponding element of the captured array variable. Therefore, the array element type must be default-constructible and copy-assignable in host code.
      对于数组类型的变量,在发送给主机编译器的代码中,闭包类型的数组字段首先被默认初始化,然后将数组字段的每个元素从捕获的数组变量的相应元素复制赋值。因此,数组元素类型在主机代码中必须是默认可构造和可复制赋值的。

    • A function parameter that is an element of a variadic argument pack cannot be captured.
      无法捕获可变参数包的元素作为函数参数。

    • The type of the captured variable cannot involve types that are either local to a function (except for closure types of extended lambdas), or are private or protected class members.
      捕获变量的类型不能涉及对函数局部类型(除了扩展 lambda 的闭包类型)或私有或受保护的类成员的类型。

    • For a __host__ __device__ extended lambda, the types used in the return or parameter types of the lambda expression’s operator() cannot involve types that are either local to a function (except for closure types of extended lambdas), or are private or protected class members.
      对于一个 __host__ __device__ 扩展的 lambda,lambda 表达式的返回类型或参数类型中使用的类型不能涉及到函数局部的类型(除了扩展 lambda 的闭包类型)或者是私有或受保护的类成员。

    • Init-capture is not supported for __host__ __device__ extended lambdas. Init-capture is supported for __device__ extended lambdas, except when the init-capture is of array type or of type std::initializer_list.
      Init-capture 不支持__host__ __device__扩展 lambda。对于__device__扩展 lambda,除非 init-capture 是数组类型或类型 std::initializer_list ,否则支持 init-capture。

    • The function call operator for an extended lambda is not constexpr. The closure type for an extended lambda is not a literal type. The constexpr and consteval specifier cannot be used in the declaration of an extended lambda.
      扩展 lambda 的函数调用运算符不是 constexpr。扩展 lambda 的闭包类型不是字面类型。在扩展 lambda 的声明中不能使用 constexpr 和 consteval 修饰符。

    • A variable cannot be implicitly captured inside an if-constexpr block lexically nested inside an extended lambda, unless it has already been implicitly captured earlier outside the if-constexpr block or appears in the explicit capture list for the extended lambda (see example below).
      变量不能在扩展 lambda 内部的 if-constexpr 块中隐式捕获,除非它已经在 if-constexpr 块外部隐式捕获或出现在扩展 lambda 的显式捕获列表中(请参见下面的示例)。

    Example 示例

    void foo(void) {
      // OK: an init-capture is allowed for an
      // extended __device__ lambda.
      auto lam1 = [x = 1] __device__ () { return x; };
    
      // Error: an init-capture is not allowed for
      // an extended __host__ __device__ lambda.
      auto lam2 = [x = 1] __host__ __device__ () { return x; };
    
      int a = 1;
      // Error: an extended __device__ lambda cannot capture
      // variables by reference.
      auto lam3 = [&a] __device__ () { return a; };
    
      // Error: by-reference capture is not allowed
      // for an extended __device__ lambda.
      auto lam4 = [&x = a] __device__ () { return x; };
    
      struct S1_t { };
      S1_t s1;
      // Error: a type local to a function cannot be used in the type
      // of a captured variable.
      auto lam6 = [s1] __device__ () { };
    
      // Error: an init-capture cannot be of type std::initializer_list.
      auto lam7 = [x = {11}] __device__ () { };
    
      std::initializer_list<int> b = {11,22,33};
      // Error: an init-capture cannot be of type std::initializer_list.
      auto lam8 = [x = b] __device__ () { };
    
      // Error scenario (lam9) and supported scenarios (lam10, lam11)
      // for capture within 'if-constexpr' block
      int yyy = 4;
      auto lam9 = [=] __device__ {
        int result = 0;
        if constexpr(false) {
          //Error: An extended __device__ lambda cannot first-capture
          //      'yyy' in constexpr-if context
          result += yyy;
        }
        return result;
      };
    
      auto lam10 = [yyy] __device__ {
        int result = 0;
        if constexpr(false) {
          //OK: 'yyy' already listed in explicit capture list for the extended lambda
          result += yyy;
        }
        return result;
      };
    
      auto lam11 = [=] __device__ {
        int result = yyy;
        if constexpr(false) {
          //OK: 'yyy' already implicit captured outside the 'if-constexpr' block
          result += yyy;
        }
        return result;
      };
    }
    
  13. When parsing a function, the CUDA compiler assigns a counter value to each extended lambda within that function. This counter value is used in the substituted named type passed to the host compiler. Hence, whether or not an extended lambda is defined within a function should not depend on a particular value of __CUDA_ARCH__, or on __CUDA_ARCH__ being undefined.
    在解析函数时,CUDA 编译器为该函数中的每个扩展 lambda 分配一个计数器值。此计数器值用于传递给主机编译器的替代命名类型。因此,函数内是否定义扩展 lambda 不应取决于特定值 __CUDA_ARCH__ ,或者 __CUDA_ARCH__ 是否未定义。

    Example 示例

    template <typename T>
    __global__ void kernel(T in) { in(); }
    
    __host__ __device__ void foo(void) {
      // Error: the number and relative declaration
      // order of extended lambdas depends on
      // __CUDA_ARCH__
    #if defined(__CUDA_ARCH__)
      auto lam1 = [] __device__ { return 0; };
      auto lam1b = [] __host___ __device__ { return 10; };
    #endif
      auto lam2 = [] __device__ { return 4; };
      kernel<<<1,1>>>(lam2);
    }
    
  14. As described above, the CUDA compiler replaces a __device__ extended lambda defined in a host function with a placeholder type defined in namespace scope. Unless the trait __nv_is_extended_device_lambda_with_preserved_return_type() returns true for the closure type of the extended lambda, the placeholder type does not define a operator() function equivalent to the original lambda declaration. An attempt to determine the return type or parameter types of the operator() function of such a lambda may therefore work incorrectly in host code, as the code processed by the host compiler will be semantically different than the input code processed by the CUDA compiler. However, it is OK to introspect the return type or parameter types of the operator() function within device code. Note that this restriction does not apply to __host__ __device__ extended lambdas, or to __device__ extended lambdas for which the trait __nv_is_extended_device_lambda_with_preserved_return_type() returns true.
    如上所述,CUDA 编译器将主机函数中定义的 __device__ 扩展 lambda 替换为命名空间范围中定义的占位符类型。除非扩展 lambda 的闭包类型满足特性 __nv_is_extended_device_lambda_with_preserved_return_type() 返回 true,否则占位符类型不会定义等同于原始 lambda 声明的 operator() 函数。因此,在主机代码中尝试确定此类 lambda 的 operator() 函数的返回类型或参数类型可能会在主机编译器处理的代码与 CUDA 编译器处理的输入代码在语义上不同的情况下出现错误。但是,在设备代码中检查 operator() 函数的返回类型或参数类型是可以的。请注意,此限制不适用于 __host__ __device__ 扩展 lambda,也不适用于特性 __nv_is_extended_device_lambda_with_preserved_return_type() 返回 true 的 __device__ 扩展 lambda。

    Example 示例

    #include <type_traits>
    const char& getRef(const char* p) { return *p; }
    
    void foo(void) {
      auto lam1 = [] __device__ { return "10"; };
    
      // Error: attempt to extract the return type
      // of a __device__ lambda in host code
      std::result_of<decltype(lam1)()>::type xx1 = "abc";
    
    
      auto lam2 = [] __host__ __device__  { return "10"; };
    
      // OK : lam2 represents a __host__ __device__ extended lambda
      std::result_of<decltype(lam2)()>::type xx2 = "abc";
    
      auto lam3 = []  __device__ () -> const char * { return "10"; };
    
      // OK : lam3 represents a __device__ extended lambda with preserved return type
      std::result_of<decltype(lam3)()>::type xx2 = "abc";
      static_assert( std::is_same_v< std::result_of<decltype(lam3)()>::type, const char *>);
    
      auto lam4 = [] __device__ (char x) -> decltype(getRef(&x)) { return 0; };
      // lam4's return type is not preserved because it references the operator()'s
      // parameter types in the trailing return type.
      static_assert( ! __nv_is_extended_device_lambda_with_preserved_return_type(decltype(lam4)), "" );
    }
    
  15. For an extended device lambda: - Introspecting the parameter type of operator() is only supported in device code. - Introspecting the return type of operator() is supported only in device code, unless the trait function __nv_is_extended_device_lambda_with_preserved_return_type() returns true.
    对于扩展设备 lambda: - 仅在设备代码中支持检查 operator() 的参数类型。 - 仅在设备代码中支持检查 operator() 的返回类型,除非特性函数 __nv_is_extended_device_lambda_with_preserved_return_type() 返回 true。

  16. If the functor object represented by an extended lambda is passed from host to device code (e.g., as the argument of a __global__ function), then any expression in the body of the lambda expression that captures variables must be remain unchanged irrespective of whether the __CUDA_ARCH__ macro is defined, and whether the macro has a particular value. This restriction arises because the lambda’s closure class layout depends on the order in which captured variables are encountered when the compiler processes the lambda expression; the program may execute incorrectly if the closure class layout differs in device and host compilation.
    如果通过扩展的 lambda 表示的函数对象从主机代码传递到设备代码(例如,作为 __global__ 函数的参数),则 lambda 表达式体中捕获变量的任何表达式都必须保持不变,无论 __CUDA_ARCH__ 宏是否被定义,以及宏是否具有特定值。这种限制是因为 lambda 的闭包类布局取决于编译器处理 lambda 表达式时捕获变量的顺序;如果闭包类布局在设备和主机编译中不同,程序可能会执行不正确。

    Example 示例

    __device__ int result;
    
    template <typename T>
    __global__ void kernel(T in) { result = in(); }
    
    void foo(void) {
      int x1 = 1;
      auto lam1 = [=] __host__ __device__ {
        // Error: "x1" is only captured when __CUDA_ARCH__ is defined.
    #ifdef __CUDA_ARCH__
        return x1 + 1;
    #else
        return 10;
    #endif
      };
      kernel<<<1,1>>>(lam1);
    }
    
  17. As described previously, the CUDA compiler replaces an extended __device__ lambda expression with an instance of a placeholder type in the code sent to the host compiler. This placeholder type does not define a pointer-to-function conversion operator in host code, however the conversion operator is provided in device code. Note that this restriction does not apply to __host__ __device__ extended lambdas.
    如前所述,CUDA 编译器会将扩展的 __device__ lambda 表达式替换为一个占位符类型的实例,该实例将被发送到主机编译器的代码中。然而,在主机代码中,该占位符类型不定义指向函数的转换运算符,但在设备代码中提供了转换运算符。请注意,此限制不适用于 __host__ __device__ 扩展的 lambda 表达式。

    Example 示例

    template <typename T>
    __global__ void kern(T in) {
      int (*fp)(double) = in;
    
      // OK: conversion in device code is supported
      fp(0);
      auto lam1 = [](double) { return 1; };
    
      // OK: conversion in device code is supported
      fp = lam1;
      fp(0);
    }
    
    void foo(void) {
      auto lam_d = [] __device__ (double) { return 1; };
      auto lam_hd = [] __host__ __device__ (double) { return 1; };
      kern<<<1,1>>>(lam_d);
      kern<<<1,1>>>(lam_hd);
    
      // OK : conversion for __host__ __device__ lambda is supported
      // in host code
      int (*fp)(double) = lam_hd;
    
      // Error: conversion for __device__ lambda is not supported in
      // host code.
      int (*fp2)(double) = lam_d;
    }
    
  18. As described previously, the CUDA compiler replaces an extended __device__ or __host__ __device__ lambda expression with an instance of a placeholder type in the code sent to the host compiler. This placeholder type may define C++ special member functions (e.g. constructor, destructor). As a result, some standard C++ type traits may return different results for the closure type of the extended lambda, in the CUDA frontend compiler versus the host compiler. The following type traits are affected: std::is_trivially_copyable, std::is_trivially_constructible, std::is_trivially_copy_constructible, std::is_trivially_move_constructible, std::is_trivially_destructible.
    如前所述,CUDA 编译器会将扩展的 __device____host__ __device__ lambda 表达式替换为一个占位符类型的实例,然后将其发送给主机编译器。这个占位符类型可能定义了 C++特殊成员函数(例如构造函数、析构函数)。因此,对于扩展 lambda 的闭包类型,在 CUDA 前端编译器和主机编译器中,一些标准的 C++类型特征可能返回不同的结果。受影响的类型特征包括: std::is_trivially_copyablestd::is_trivially_constructiblestd::is_trivially_copy_constructiblestd::is_trivially_move_constructiblestd::is_trivially_destructible

    Care must be taken that the results of these type traits are not used in __global__ function template instantiation or in __device__ / __constant__ / __managed__ variable template instantiation.
    必须小心,这些类型特征的结果不要在 __global__ 函数模板实例化或 __device__ / __constant__ / __managed__ 变量模板实例化中使用。

    Example 示例

    template <bool b>
    void __global__ foo() { printf("hi"); }
    
    template <typename T>
    void dolaunch() {
    
    // ERROR: this kernel launch may fail, because CUDA frontend compiler
    // and host compiler may disagree on the result of
    // std::is_trivially_copyable() trait on the closure type of the
    // extended lambda
    foo<std::is_trivially_copyable<T>::value><<<1,1>>>();
    cudaDeviceSynchronize();
    }
    
    int main() {
    int x = 0;
    auto lam1 = [=] __host__ __device__ () { return x; };
    dolaunch<decltype(lam1)>();
    }
    

The CUDA compiler will generate compiler diagnostics for a subset of cases described in 1-12; no diagnostic will be generated for cases 13-17, but the host compiler may fail to compile the generated code.
CUDA 编译器将为 1-12 中描述的一部分情况生成编译器诊断信息;对于 13-17 中的情况不会生成诊断信息,但主机编译器可能无法编译生成的代码。

14.7.3. Notes on __host__ __device__ lambdas
14.7.3. 关于 __host__ __device__ lambda 的注意事项 

Unlike __device__ lambdas, __host__ __device__ lambdas can be called from host code. As described earlier, the CUDA compiler replaces an extended lambda expression defined in host code with an instance of a named placeholder type. The placeholder type for an extended __host__ __device__ lambda invokes the original lambda’s operator() with an indirect function call 30.
__device__ lambda 不同, __host__ __device__ lambda 可以从宿主代码中调用。如前所述,CUDA 编译器会将在宿主代码中定义的扩展 lambda 表达式替换为命名占位符类型的实例。扩展 __host__ __device__ lambda 的占位符类型通过间接函数调用调用原始 lambda 的 operator()

The presence of the indirect function call may cause an extended __host__ __device__ lambda to be less optimized by the host compiler than lambdas that are implicitly or explicitly __host__ only. In the latter case, the host compiler can easily inline the body of the lambda into the calling context. But in case of an extended __host__                                  __device__ lambda, the host compiler encounters the indirect function call and may not be able to easily inline the original __host__ __device__ lambda body.
间接函数调用的存在可能会导致扩展的 __host__ __device__ lambda 比那些隐式或显式 __host__ 的 lambda 在主机编译器中优化程度较低。在后一种情况下,主机编译器可以轻松地将 lambda 的主体内联到调用上下文中。但在扩展的 __host__                                  __device__ lambda 的情况下,主机编译器会遇到间接函数调用,可能无法轻松地将原始 __host__ __device__ lambda 主体内联。

14.7.4. *this Capture By Value
14.7.4. *按值捕获 this 

When a lambda is defined within a non-static class member function, and the body of the lambda refers to a class member variable, C++11/C++14 rules require that the this pointer of the class is captured by value, instead of the referenced member variable. If the lambda is an extended __device__ or __host____device__ lambda defined in a host function, and the lambda is executed on the GPU, accessing the referenced member variable on the GPU will cause a run time error if the this pointer points to host memory.
当 lambda 在非静态类成员函数内定义,并且 lambda 的主体引用了类成员变量时,C++11/C++14 规则要求通过值捕获类的 this 指针,而不是引用的成员变量。如果 lambda 是在宿主函数中定义的扩展 __device____host__ __device__ lambda,并且 lambda 在 GPU 上执行,如果 this 指针指向宿主内存,则在 GPU 上访问引用的成员变量将导致运行时错误。

Example: 示例:

#include <cstdio>

template <typename T>
__global__ void foo(T in) { printf("\n value = %d", in()); }

struct S1_t {
  int xxx;
  __host__ __device__ S1_t(void) : xxx(10) { };

  void doit(void) {

    auto lam1 = [=] __device__ {
       // reference to "xxx" causes
       // the 'this' pointer (S1_t*) to be captured by value
       return xxx + 1;

    };

    // Kernel launch fails at run time because 'this->xxx'
    // is not accessible from the GPU
    foo<<<1,1>>>(lam1);
    cudaDeviceSynchronize();
  }
};

int main(void) {
  S1_t s1;
  s1.doit();
}

C++17 solves this problem by adding a new “*this” capture mode. In this mode, the compiler makes a copy of the object denoted by “*this” instead of capturing the pointer this by value. The “*this” capture mode is described in more detail here: http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2016/p0018r3.html .
C++17 通过添加新的“*this”捕获模式来解决这个问题。在这种模式下,编译器会复制由“*this”表示的对象,而不是通过值捕获指针 this 。有关“*this”捕获模式的详细信息,请参阅这里: http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2016/p0018r3.html

The CUDA compiler supports the “*this” capture mode for lambdas defined within __device__ and __global__ functions and for extended __device__ lambdas defined in host code, when the --extended-lambda nvcc flag is used.
CUDA 编译器支持在主机代码中定义的 lambda 函数和扩展 lambda 函数中使用“*this”捕获模式,当使用 nvcc 标志时。

Here’s the above example modified to use “*this” capture mode:
这是上面示例修改后使用“*this”捕获模式的示例:

#include <cstdio>

template <typename T>
__global__ void foo(T in) { printf("\n value = %d", in()); }

struct S1_t {
  int xxx;
  __host__ __device__ S1_t(void) : xxx(10) { };

  void doit(void) {

    // note the "*this" capture specification
    auto lam1 = [=, *this] __device__ {

       // reference to "xxx" causes
       // the object denoted by '*this' to be captured by
       // value, and the GPU code will access copy_of_star_this->xxx
       return xxx + 1;

    };

    // Kernel launch succeeds
    foo<<<1,1>>>(lam1);
    cudaDeviceSynchronize();
  }
};

int main(void) {
  S1_t s1;
  s1.doit();
}

“*this” capture mode is not allowed for unannotated lambdas defined in host code, or for extended __host____device__ lambdas. Examples of supported and unsupported usage:
“*this” 捕获模式不允许在主机代码中定义的未注释的 lambda 函数或扩展 __host__ __device__ lambda 函数中使用。支持和不支持的用法示例:

struct S1_t {
  int xxx;
  __host__ __device__ S1_t(void) : xxx(10) { };

  void host_func(void) {

    // OK: use in an extended __device__ lambda
    auto lam1 = [=, *this] __device__ { return xxx; };

    // Error: use in an extended __host__ __device__ lambda
    auto lam2 = [=, *this] __host__ __device__ { return xxx; };

    // Error: use in an unannotated lambda in host function
    auto lam3 = [=, *this]  { return xxx; };
  }

  __device__ void device_func(void) {

    // OK: use in a lambda defined in a __device__ function
    auto lam1 = [=, *this] __device__ { return xxx; };

    // OK: use in a lambda defined in a __device__ function
    auto lam2 = [=, *this] __host__ __device__ { return xxx; };

    // OK: use in a lambda defined in a __device__ function
    auto lam3 = [=, *this]  { return xxx; };
  }

   __host__ __device__ void host_device_func(void) {

    // OK: use in an extended __device__ lambda
    auto lam1 = [=, *this] __device__ { return xxx; };

    // Error: use in an extended __host__ __device__ lambda
    auto lam2 = [=, *this] __host__ __device__ { return xxx; };

    // Error: use in an unannotated lambda in a __host__ __device__ function
    auto lam3 = [=, *this]  { return xxx; };
  }
};

14.7.5. Additional Notes
14.7.5. 附加说明 

  1. ADL Lookup: As described earlier, the CUDA compiler will replace an extended lambda expression with an instance of a placeholder type, before invoking the host compiler. One template argument of the placeholder type uses the address of the function enclosing the original lambda expression. This may cause additional namespaces to participate in argument dependent lookup (ADL), for any host function call whose argument types involve the closure type of the extended lambda expression. This may cause an incorrect function to be selected by the host compiler.
    ADL Lookup :如前所述,CUDA 编译器将在调用主机编译器之前,用占位符类型的实例替换扩展的 lambda 表达式。占位符类型的一个模板参数使用包围原始 lambda 表达式的函数的地址。这可能导致其他命名空间参与参数相关查找(ADL),对于任何主机函数调用,其参数类型涉及扩展 lambda 表达式的闭包类型。这可能导致主机编译器选择错误的函数。

    Example: 示例:

    namespace N1 {
      struct S1_t { };
      template <typename T>  void foo(T);
    };
    
    namespace N2 {
      template <typename T> int foo(T);
    
      template <typename T>  void doit(T in) {     foo(in);  }
    }
    
    void bar(N1::S1_t in) {
      /* extended __device__ lambda. In the code sent to the host compiler, this
         is replaced with the placeholder type instantiation expression
         ' __nv_dl_wrapper_t< __nv_dl_tag<void (*)(N1::S1_t in),(&bar),1> > { }'
    
         As a result, the namespace 'N1' participates in ADL lookup of the
         call to "foo" in the body of N2::doit, causing ambiguity.
      */
      auto lam1 = [=] __device__ { };
      N2::doit(lam1);
    }
    

    In the example above, the CUDA compiler replaced the extended lambda with a placeholder type that involves the N1 namespace. As a result, the namespace N1 participates in the ADL lookup for foo(in) in the body of N2::doit, and host compilation fails because multiple overload candidates N1::foo and N2::foo are found.
    在上面的示例中,CUDA 编译器用涉及 N1 命名空间的占位类型替换了扩展 lambda。因此,命名空间 N1 参与了 N2::doit 主体中 foo(in) 的 ADL 查找,主机编译失败,因为找到了多个重载候选项 N1::fooN2::foo

14.8. Code Samples
14.8. 代码示例 

14.8.1. Data Aggregation Class
14.8.1. 数据聚合类 

class PixelRGBA {
public:
    __device__ PixelRGBA(): r_(0), g_(0), b_(0), a_(0) { }

    __device__ PixelRGBA(unsigned char r, unsigned char g,
                         unsigned char b, unsigned char a = 255):
                         r_(r), g_(g), b_(b), a_(a) { }

private:
    unsigned char r_, g_, b_, a_;

    friend PixelRGBA operator+(const PixelRGBA&, const PixelRGBA&);
};

__device__
PixelRGBA operator+(const PixelRGBA& p1, const PixelRGBA& p2)
{
    return PixelRGBA(p1.r_ + p2.r_, p1.g_ + p2.g_,
                     p1.b_ + p2.b_, p1.a_ + p2.a_);
}

__device__ void func(void)
{
    PixelRGBA p1, p2;
    // ...      // Initialization of p1 and p2 here
    PixelRGBA p3 = p1 + p2;
}

14.8.2. Derived Class
14.8.2. 派生类 

__device__ void* operator new(size_t bytes, MemoryPool& p);
__device__ void operator delete(void*, MemoryPool& p);
class Shape {
public:
    __device__ Shape(void) { }
    __device__ void putThis(PrintBuffer *p) const;
    __device__ virtual void Draw(PrintBuffer *p) const {
         p->put("Shapeless");
    }
    __device__ virtual ~Shape() {}
};
class Point : public Shape {
public:
    __device__ Point() : x(0), y(0) {}
    __device__ Point(int ix, int iy) : x(ix), y(iy) { }
    __device__ void PutCoord(PrintBuffer *p) const;
    __device__ void Draw(PrintBuffer *p) const;
    __device__ ~Point() {}
private:
    int x, y;
};
__device__ Shape* GetPointObj(MemoryPool& pool)
{
    Shape* shape = new(pool) Point(rand(-20,10), rand(-100,-20));
    return shape;
}

14.8.3. Class Template
14.8.3. 类模板 

template <class T>
class myValues {
    T values[MAX_VALUES];
public:
    __device__ myValues(T clear) { ... }
    __device__ void setValue(int Idx, T value) { ... }
    __device__ void putToMemory(T* valueLocation) { ... }
};

template <class T>
void __global__ useValues(T* memoryBuffer) {
    myValues<T> myLocation(0);
    ...
}

__device__ void* buffer;

int main()
{
    ...
    useValues<int><<<blocks, threads>>>(buffer);
    ...
}

14.8.4. Function Template
14.8.4. 函数模板 

template <typename T>
__device__ bool func(T x)
{
   ...
   return (...);
}

template <>
__device__ bool func<int>(T x) // Specialization
{
   return true;
}

// Explicit argument specification
bool result = func<double>(0.5);

// Implicit argument deduction
int x = 1;
bool result = func(x);

14.8.5. Functor Class
14.8.5. 函子类 

class Add {
public:
    __device__  float operator() (float a, float b) const
    {
        return a + b;
    }
};

class Sub {
public:
    __device__  float operator() (float a, float b) const
    {
        return a - b;
    }
};

// Device code
template<class O> __global__
void VectorOperation(const float * A, const float * B, float * C,
                     unsigned int N, O op)
{
    unsigned int iElement = blockDim.x * blockIdx.x + threadIdx.x;
    if (iElement < N)
        C[iElement] = op(A[iElement], B[iElement]);
}

// Host code
int main()
{
    ...
    VectorOperation<<<blocks, threads>>>(v1, v2, v3, N, Add());
    ...
}
15

e.g., the <<<...>>> syntax for launching kernels.
例如,用于启动内核的 <<<...>>> 语法。

16

This does not apply to entities that may be defined in more than one translation unit, such as compiler generated template instantiations.
这不适用于可能在多个翻译单元中定义的实体,比如编译器生成的模板实例化。

17

The intent is to allow variable memory space specifiers for static variables in a __host__ __device__ function during device compilation, but disallow it during host compilation
意图是在设备编译期间允许静态变量的可变内存空间限定符,但在主机编译期间禁止它

18

One way to debug suspected layout mismatch of a type C is to use printf to output the values of sizeof(C) and offsetof(C, field) in host and device code.
调试怀疑的类型 C 布局不匹配的一种方法是在主机和设备代码中使用 printf 输出 sizeof(C)offsetof(C, field) 的值。

19

Note that this may negatively impact compile time due to presence of extra declarations.
请注意,这可能会由于额外声明的存在而对编译时间产生负面影响。

20

At present, the -std=c++11 flag is supported only for the following host compilers : gcc version >= 4.7, clang, icc >= 15, and xlc >= 13.1
目前, -std=c++11 标志仅支持以下主机编译器:gcc 版本>= 4.7,clang,icc >= 15 和 xlc >= 13.1

21

including operator() 包括 operator()

22

The restrictions are the same as with a non-constexpr callee function.
限制与非 constexpr 被调用函数相同。

23

Note that the behavior of experimental flags may change in future compiler releases.
请注意,实验性标志的行为可能会在未来的编译器版本中发生变化。

24

C++ Standard Section [basic.types]
C++ 标准第 [basic.types]

25

C++ Standard Section [expr.const]
C++ 标准第 [expr.const]

26

At present, the -std=c++14 flag is supported only for the following host compilers : gcc version >= 5.1, clang version >= 3.7 and icc version >= 17
目前, -std=c++14 标志仅支持以下主机编译器:gcc 版本>= 5.1,clang 版本>= 3.7 和 icc 版本>= 17

27

At present, the -std=c++17 flag is supported only for the following host compilers : gcc version >= 7.0, clang version >= 8.0, Visual Studio version >= 2017, pgi compiler version >= 19.0, icc compiler version >= 19.0
目前, -std=c++17 标志仅支持以下主机编译器:gcc 版本 >= 7.0,clang 版本 >= 8.0,Visual Studio 版本 >= 2017,pgi 编译器版本 >= 19.0,icc 编译器版本 >= 19.0

28

At present, the -std=c++20 flag is supported only for the following host compilers : gcc version >= 10.0, clang version >= 10.0, Visual Studio Version >= 2022 and nvc++ version >= 20.7.
目前, -std=c++20 标志仅支持以下主机编译器:gcc 版本 >= 10.0,clang 版本 >= 10.0,Visual Studio 版本 >= 2022 和 nvc++ 版本 >= 20.7。

29

When using the icc host compiler, this flag is only supported for icc >= 1800.
当使用 icc 主机编译器时,此标志仅支持 icc >= 1800。

30(1,2)

The traits will always return false if extended lambda mode is not active.
如果未激活扩展的 lambda 模式,则特性将始终返回 false。

31

In contrast, the C++ standard specifies that the captured variable is used to direct-initialize the field of the closure type.
相比之下,C++ 标准规定捕获的变量用于直接初始化闭包类型的字段。

15. Texture Fetching
15. 纹理获取 

This section gives the formula used to compute the value returned by the texture functions of Texture Functions depending on the various attributes of the texture object (see Texture and Surface Memory).
本节给出了用于计算纹理函数返回值的公式,这些公式取决于纹理对象的各种属性(请参阅纹理和表面内存)。

The texture bound to the texture object is represented as an array T of
绑定到纹理对象的纹理表示为数组 T

  • N texels for a one-dimensional texture,
    一维纹理的 N 个纹素,

  • N x M texels for a two-dimensional texture,
    一个二维纹理的 N x M 个纹素

  • N x M x L texels for a three-dimensional texture.
    三维纹理的 N x M x L 个纹素。

It is fetched using non-normalized texture coordinates x, y, and z, or the normalized texture coordinates x/N, y/M, and z/L as described in Texture Memory. In this section, the coordinates are assumed to be in the valid range. Texture Memory explained how out-of-range coordinates are remapped to the valid range based on the addressing mode.
它是使用非规范化纹理坐标 x、y 和 z,或者在纹理内存中描述的规范化纹理坐标 x/N、y/M 和 z/L 获取的。在本节中,假定坐标处于有效范围内。纹理内存解释了如何根据寻址模式将超出范围的坐标重新映射到有效范围内。

15.1. Nearest-Point Sampling
15.1. 最近点采样 

In this filtering mode, the value returned by the texture fetch is
在此过滤模式中,纹理获取返回的值是

  • tex(x)=T[i] for a one-dimensional texture,
    tex(x)=T[i] 用于一维纹理

  • tex(x,y)=T[i,j] for a two-dimensional texture,
    tex(x,y)=T[i,j]对于二维纹理

  • tex(x,y,z)=T[i,j,k] for a three-dimensional texture,
    tex(x,y,z)=T[i,j,k]对于三维纹理

where i=floor(x), j=floor(y), and k=floor(z).
其中 i=floor(x), j=floor(y), 且 k=floor(z)。

FIgure 32 illustrates nearest-point sampling for a one-dimensional texture with N=4.
图 32 说明了在 N=4 的一维纹理中的最近点采样。

_images/nearest-point-sampling-of-1-d-texture-of-4-texels.png

Figure 32 Nearest-Point Sampling Filtering Mode
图 32 最近点采样滤波模式 

For integer textures, the value returned by the texture fetch can be optionally remapped to [0.0, 1.0] (see Texture Memory).
对于整数纹理,纹理获取返回的值可以选择重新映射为[0.0, 1.0](请参阅纹理内存)。

15.2. Linear Filtering
15.2. 线性滤波 

In this filtering mode, which is only available for floating-point textures, the value returned by the texture fetch is
在此过滤模式中,仅适用于浮点纹理,纹理获取返回的值是

  • tex(x)=(1α)T[i]+αT[i+1] for a one-dimensional texture,
    对于一维纹理 tex(x)=(1α)T[i]+αT[i+1]

  • tex(x)=(1α)T[i]+αT[i+1] for a one-dimensional texture,
    对于一维纹理 tex(x)=(1α)T[i]+αT[i+1]

  • tex(x,y)=(1α)(1β)T[i,j]+α(1β)T[i+1,j]+(1α)βT[i,j+1]+αβT[i+1,j+1] for a two-dimensional texture,
    tex(x,y)=(1α)(1β)T[i,j]+α(1β)T[i+1,j]+(1α)βT[i,j+1]+αβT[i+1,j+1] 用于二维纹理

  • tex(x,y,z) =

    (1α)(1β)(1γ)T[i,j,k]+α(1β)(1γ)T[i+1,j,k]+

    (1α)β(1γ)T[i,j+1,k]+αβ(1γ)T[i+1,j+1,k]+

    (1α)(1β)γT[i,j,k+1]+α(1β)γT[i+1,j,k+1]+

    (1α)βγT[i,j+1,k+1]+αβγT[i+1,j+1,k+1]

    for a three-dimensional texture,
    对于三维纹理,

where: 在哪里:

  • i=floor(x B),α=frac(x B),x B =x0.5,

  • j=floor(y B),β=frac(y B),y B =y0.5,

  • k=floor(z B),γ=frac(z B),z B =z0.5,

α, β, and γ are stored in 9-bit fixed point format with 8 bits of fractional value (so 1.0 is exactly represented).
αβγ 以 9 位固定点格式存储,其中有 8 位小数值(因此 1.0 被精确表示)。

Figure 33 illustrates linear filtering of a one-dimensional texture with N=4.
图 33 说明了对一维纹理进行 N=4 的线性滤波。

_images/linear-filtering-of-1-d-texture-of-4-texels.png

Figure 33 Linear Filtering Mode
图 33 线性滤波模式 

15.3. Table Lookup
15.3. 表查找 

A table lookup TL(x) where x spans the interval [0,R] can be implemented as TL(x)=tex((N-1)/R)x+0.5) in order to ensure that TL(0)=T[0] and TL(R)=T[N-1].
一个表查找 TL(x),其中 x 跨越区间 [0,R],可以实现为 TL(x)=tex((N-1)/R)x+0.5),以确保 TL(0)=T[0] 和 TL(R)=T[N-1]。

Figure 34 illustrates the use of texture filtering to implement a table lookup with R=4 or R=1 from a one-dimensional texture with N=4.
图 34 说明了使用纹理过滤来从具有 N=4 的一维纹理实现具有 R=4 或 R=1 的查找表。

_images/1-d-table-lookup-using-linear-filtering.png

Figure 34 One-Dimensional Table Lookup Using Linear Filtering
图 34 一维表查找使用线性滤波

16. Compute Capabilities
16. 计算能力 

The general specifications and features of a compute device depend on its compute capability (see Compute Capability).
计算设备的一般规格和特性取决于其计算能力(请参阅计算能力)。

Table 20 and Table 21 show the features and technical specifications associated with each compute capability that is currently supported.
表 20 和表 21 显示了当前支持的每个计算能力相关的特性和技术规格。

Floating-Point Standard reviews the compliance with the IEEE floating-point standard.
浮点标准审查符合 IEEE 浮点标准。

Sections Compute Capability 5.x, Compute Capability 6.x, Compute Capability 7.x, Compute Capability 8.x and Compute Capability 9.0 give more details on the architecture of devices of compute capabilities 5.x, 6.x, 7.x, 8.x and 9.0 respectively.
部分 计算能力 5.x,计算能力 6.x,计算能力 7.x,计算能力 8.x 和计算能力 9.0 分别提供了有关计算能力 5.x,6.x,7.x,8.x 和 9.0 设备架构的更多详细信息。

16.1. Feature Availability
16.1. 功能可用性 

A compute feature is introduced with a compute architecture with the intention that the feature will be available on all subsequent architectures. This is shown in Table 20 by the “yes” for availability of a feature on compute capabilities subsequent to its introduction.
引入了计算功能,计算架构旨在使该功能在所有后续架构上可用。在表 20 中,通过“是”显示了在引入后续计算能力时功能是否可用。

Highly specialized compute features that are introduced with an architecture may not be guaranteed to be available on all subsequent compute capabilities. These features target acceleration of specialized operations which are not intended for all classes of compute capabilities (denoted by the compute capability’s minor number) or are likely to significantly change on future generations (denoted by the compute capability’s major number).
引入架构的高度专业化计算功能可能无法保证在所有后续计算能力上可用。这些功能旨在加速专门操作,不适用于所有计算能力类别(由计算能力的次要编号表示),或者可能在未来的一代中发生重大变化(由计算能力的主要编号表示)。

There are potentially two sets of compute features for a given compute capability:
对于给定的计算能力,可能有两组计算特性:

Compute Capability #.#: The predominant set of compute features that are introduced with the intent to be available for subsequent compute architectures. These features and their availability are summarized in Table 20.
计算能力 #.#:引入的主要计算功能集,旨在为后续计算架构提供。这些功能及其可用性在表 20 中总结。

Compute Capability #.#a: A small and highly specialized set of features that are introduced to accelerate specialized operations, which are not guaranteed to be available or might change significantly on subsequent compute architecture. These features are summarized in the respective “Compute Capability #.#”” subsection.
计算能力 #.#a:一组小型且高度专业化的功能,旨在加速专门操作,不能保证可用性,或在后续计算架构中可能发生重大变化。这些功能在相应的“计算能力 #.#”子部分中进行了总结。

Compilation of device code targets a particular compute capability. A feature which appears in device code must be available for the targeted compute capability. For example:
设备代码的编译针对特定的计算能力。设备代码中出现的功能必须适用于目标计算能力。例如:

  • The compute_90 compilation target allows use of Compute Capability 9.0 features but does not allow use of Compute Capability 9.0a features.
    compute_90 编译目标允许使用计算能力 9.0 特性,但不允许使用计算能力 9.0a 特性。

  • The compute_90a compilation target allows use of the complete set of compute device features, both 9.0a features and 9.0 features.
    compute_90a 编译目标允许使用完整的计算设备功能集,包括 9.0a 功能和 9.0 功能。

16.2. Features and Technical Specifications
16.2. 特性和技术规格 

Table 20 Feature Support per Compute Capability
表 20 计算能力支持的功能表 

Feature Support 功能支持

Compute Capability 计算能力

(Unlisted features are supported for all compute capabilities)
(未列出的功能支持所有计算能力)

5.0, 5.2

5.3

6.x

7.x

8.x

9.0

Atomic functions operating on 32-bit integer values in global memory (Atomic Functions)
在全局内存中操作 32 位整数值的原子函数(原子函数)

Yes 

Atomic functions operating on 32-bit integer values in shared memory (Atomic Functions)
在共享内存中操作 32 位整数值的原子函数(原子函数)

Yes 

Atomic functions operating on 64-bit integer values in global memory (Atomic Functions)
在全局内存中操作 64 位整数值的原子函数(原子函数)

Yes 

Atomic functions operating on 64-bit integer values in shared memory (Atomic Functions)
在共享内存中操作 64 位整数值的原子函数(原子函数)

Yes 

Atomic functions operating on 128-bit integer values in global memory (Atomic Functions)
在全局内存中操作 128 位整数值的原子函数(原子函数)

No 

Yes 

Atomic functions operating on 128-bit integer values in shared memory (Atomic Functions)
在共享内存中操作 128 位整数值的原子函数(原子函数)

No 

Yes 

Atomic addition operating on 32-bit floating point values in global and shared memory (atomicAdd())
在全局和共享内存中对 32 位浮点值执行原子加法操作(atomicAdd())

Yes 

Atomic addition operating on 64-bit floating point values in global memory and shared memory (atomicAdd())
在全局内存和共享内存中对 64 位浮点值执行原子加法操作(atomicAdd())

No 

Yes 

Atomic addition operating on float2 and float4 floating point vectors in global memory (atomicAdd())
在全局内存中对 float2 和 float4 浮点向量执行原子加法操作(atomicAdd())

No 

Yes 

Warp vote functions (Warp Vote Functions)
Warp vote functions(Warp 投票功能)

Yes 

Memory fence functions (Memory Fence Functions)
内存栅栏函数(Memory Fence Functions)

Yes 

Synchronization functions (Synchronization Functions)
同步功能(Synchronization Functions)

Yes 

Surface functions (Surface Functions)
表面函数(Surface Functions)

Yes 

Unified Memory Programming (Unified Memory Programming)
统一内存编程(统一内存编程)

Yes 

Dynamic Parallelism (CUDA Dynamic Parallelism)
动态并行性(CUDA 动态并行性)

Yes 

Half-precision floating-point operations: addition, subtraction, multiplication, comparison, warp shuffle functions, conversion
半精度浮点运算:加法、减法、乘法、比较、warp shuffle 函数、转换

No 

Yes 

Bfloat16-precision floating-point operations: addition, subtraction, multiplication, comparison, warp shuffle functions, conversion
Bfloat16-精度浮点运算:加法、减法、乘法、比较、warp shuffle 函数、转换

No 

Yes 

Tensor Cores 张量核心

No 

Yes 

Mixed Precision Warp-Matrix Functions (Warp matrix functions)
混合精度 Warp-Matrix 函数(Warp 矩阵函数)

No 

Yes 

Hardware-accelerated memcpy_async (Asynchronous Data Copies using cuda::pipeline)
硬件加速 memcpy_async (使用 cuda::pipeline 进行异步数据复制)

No 

Yes 

Hardware-accelerated Split Arrive/Wait Barrier (Asynchronous Barrier)
硬件加速的分裂到达/等待屏障(异步屏障)

No 

Yes 

L2 Cache Residency Management (Device Memory L2 Access Management)
L2 缓存驻留管理(设备内存 L2 访问管理)

No 

Yes 

DPX Instructions for Accelerated Dynamic Programming
DPX 加速动态规划的说明

No 

Yes 

Distributed Shared Memory
分布式共享内存

No 

Yes 

Thread Block Cluster 线程块集群

No 

Yes 

Tensor Memory Accelerator (TMA) unit
张量内存加速器(TMA)单元

No 

Yes 

Note that the KB and K units used in the following table correspond to 1024 bytes (i.e., a KiB) and 1024 respectively.
请注意,下表中使用的 KB 和 K 单位分别对应于 1024 字节(即 KiB)和 1024。

Table 21 Technical Specifications per Compute Capability
表 21 每个计算能力的技术规格

Compute Capability 计算能力

Technical Specifications 技术规格

5.0

5.2

5.3

6.0

6.1

6.2

7.0

7.2

7.5

8.0

8.6

8.7

8.9

9.0

Maximum number of resident grids per device (Concurrent Kernel Execution)
每个设备的最大常驻网格数(并发内核执行)

32

16

128

32

16

128

16

128

Maximum dimensionality of grid of thread blocks
线程块网格的最大维数

3

Maximum x -dimension of a grid of thread blocks [thread blocks]
线程块网格的最大 x 维度 [线程块]

231-1

Maximum y- or z-dimension of a grid of thread blocks
线程块网格的 y 或 z 维度的最大值

65535

Maximum dimensionality of thread block
线程块的最大维度

3

Maximum x- or y-dimensionality of a block
块的最大 x 或 y 维度

1024

Maximum z-dimension of a block
块的最大 z 维度

64

Maximum number of threads per block
每个块的最大线程数

1024

Warp size 线程束大小

32

Maximum number of resident blocks per SM
每个 SM 的最大驻留块数

32

16

32

16

24

32

Maximum number of resident warps per SM
每个 SM 的最大常驻 warp 数量

64

32

64

48

64

Maximum number of resident threads per SM
每个 SM 的最大居住线程数

2048

1024

2048

1536

2048

Number of 32-bit registers per SM
每个 SM 的 32 位寄存器数量

64 K

Maximum number of 32-bit registers per thread block
每个线程块的 32 位寄存器的最大数量

64 K

32 K

64 K

32 K

64 K

Maximum number of 32-bit registers per thread
每个线程的 32 位寄存器的最大数量

255

Maximum amount of shared memory per SM
每个 SM 的最大共享内存量

64 KB

96 KB

64 KB

96 KB

64 KB

96 KB

64 KB

164 KB

100 KB

164 KB

100 KB

228 KB

Maximum amount of shared memory per thread block 32
每个线程块的最大共享内存量为 32

48 KB

96 KB

96 KB

64 KB

163 KB

99 KB

163 KB

99 KB

227 KB

Number of shared memory banks
共享内存银行数量

32

Maximum amount of local memory per thread
每个线程的本地内存最大量

512 KB

Constant memory size 常量内存大小

64 KB

Cache working set per SM for constant memory
为每个 SM 缓存常量内存的工作集

8 KB

4 KB

8 KB

Cache working set per SM for texture memory
为纹理内存缓存每个 SM 的工作集

Between 12 KB and 48 KB
介于 12 KB 和 48 KB 之间

Between 24 KB and 48 KB
介于 24 KB 和 48 KB 之间

32 ~ 128 KB

32 or 64 KB 32 或 64 KB

28 KB ~ 192 KB

28 KB ~ 128 KB

28 KB ~ 192 KB

28 KB ~ 128 KB

28 KB ~ 256 KB

Maximum width for a 1D texture object using a CUDA array
使用 CUDA 数组的 1D 纹理对象的最大宽度

65536

131072

Maximum width for a 1D texture object using linear memory
使用线性内存的一维纹理对象的最大宽度

227

228

227

228

227

228

Maximum width and number of layers for a 1D layered texture object
1D 分层纹理对象的最大宽度和层数

16384 x 2048

32768 x 2048

Maximum width and height for a 2D texture object using a CUDA array
使用 CUDA 数组的 2D 纹理对象的最大宽度和高度

65536 x 65536

131072 x 65536

Maximum width and height for a 2D texture object using linear memory
使用线性内存的 2D 纹理对象的最大宽度和高度

65536 x 65536

131072 x 65000

Maximum width and height for a 2D texture object using a CUDA array supporting texture gather
使用支持纹理聚合的 CUDA 数组的 2D 纹理对象的最大宽度和高度

16384 x 16384

32768 x 32768

Maximum width, height, and number of layers for a 2D layered texture object
2D 分层纹理对象的最大宽度、高度和层数

16384 x 16384 x 2048

32768 x 32768 x 2048

Maximum width, height, and depth for a 3D texture object using to a CUDA array
使用 CUDA 数组的 3D 纹理对象的最大宽度、高度和深度

4096 x 4096 x 4096

16384 x 16384 x 16384

Maximum width (and height) for a cubemap texture object
立方体贴图纹理对象的最大宽度(和高度)

16384

32768

Maximum width (and height) and number of layers for a cubemap layered texture object
立方体贴图分层纹理对象的最大宽度(和高度)和层数

16384 x 2046

32768 x 2046

Maximum number of textures that can be bound to a kernel
可以绑定到内核的纹理的最大数量

256

Maximum width for a 1D surface object using a CUDA array
使用 CUDA 数组的 1D 表面对象的最大宽度

16384

32768

Maximum width and number of layers for a 1D layered surface object
一维分层表面对象的最大宽度和层数

16384 x 2048

32768 x 2048

Maximum width and height for a 2D surface object using a CUDA array
使用 CUDA 数组的 2D 表面对象的最大宽度和高度

65536 x 65536

1 31072 x 65536

Maximum width, height, and number of layers for a 2D layered surface object
2D 分层表面对象的最大宽度、高度和层数

16384 x 16384 x 2048

32768 x 32768 x 1048

Maximum width, height, and depth for a 3D surface object using a CUDA array
使用 CUDA 数组的 3D 表面对象的最大宽度、高度和深度

4096 x 4096 x 4096

16384 x 16384 x 16384

Maximum width (and height) for a cubemap surface object using a CUDA array
使用 CUDA 数组的立方体贴图表面对象的最大宽度(和高度)

16384

32768

Maximum width (and height) and number of layers for a cubemap layered surface object
立方贴图分层表面对象的最大宽度(和高度)和层数

16384 x 2046

32768 x 2046

Maximum number of surfaces that can use a kernel
内核可以使用的最大表面数

16

32

16.3. Floating-Point Standard
16.3. 浮点标准 

All compute devices follow the IEEE 754-2008 standard for binary floating-point arithmetic with the following deviations:
所有计算设备遵循 IEEE 754-2008 标准进行二进制浮点运算,但存在以下偏差:

  • There is no dynamically configurable rounding mode; however, most of the operations support multiple IEEE rounding modes, exposed via device intrinsics.
    没有动态可配置的舍入模式;但是,大多数操作支持多个 IEEE 舍入模式,通过设备内部函数公开。

  • There is no mechanism for detecting that a floating-point exception has occurred and all operations behave as if the IEEE-754 exceptions are always masked, and deliver the masked response as defined by IEEE-754 if there is an exceptional event. For the same reason, while SNaN encodings are supported, they are not signaling and are handled as quiet.
    没有机制可以检测浮点异常是否发生,所有操作都会表现得好像 IEEE-754 异常总是被屏蔽,并且在发生异常事件时提供 IEEE-754 定义的屏蔽响应。出于同样的原因,虽然支持 SNaN 编码,但它们不是信号传递的,而是作为静默处理。

  • The result of a single-precision floating-point operation involving one or more input NaNs is the quiet NaN of bit pattern 0x7fffffff.
    单精度浮点运算涉及一个或多个输入 NaN 时的结果是比特模式为 0x7fffffff 的安静 NaN。

  • Double-precision floating-point absolute value and negation are not compliant with IEEE-754 with respect to NaNs; these are passed through unchanged.
    双精度浮点绝对值和取反与 IEEE-754 不符合 NaNs 相关; 这些将保持不变。

Code must be compiled with -ftz=false, -prec-div=true, and -prec-sqrt=true to ensure IEEE compliance (this is the default setting; see the nvcc user manual for description of these compilation flags).
代码必须使用 -ftz=false-prec-div=true-prec-sqrt=true 编译,以确保符合 IEEE 标准(这是默认设置;请参阅 nvcc 用户手册,了解这些编译标志的描述)。

Regardless of the setting of the compiler flag -ftz,
无论编译器标志 -ftz 的设置如何,

  • atomic single-precision floating-point adds on global memory always operate in flush-to-zero mode, i.e., behave equivalent to FADD.F32.FTZ.RN,
    原子单精度浮点加法在全局内存上始终以清零模式运行,即等效于 FADD.F32.FTZ.RN

  • atomic single-precision floating-point adds on shared memory always operate with denormal support, i.e., behave equivalent to FADD.F32.RN.
    原子单精度浮点加法在共享内存上始终支持非规格化数,即等效于 FADD.F32.RN

In accordance to the IEEE-754R standard, if one of the input parameters to fminf(), fmin(), fmaxf(), or fmax() is NaN, but not the other, the result is the non-NaN parameter.
根据 IEEE-754R 标准,如果 fminf()fmin()fmaxf()fmax() 中的一个输入参数是 NaN,但另一个不是,则结果为非 NaN 参数。

The conversion of a floating-point value to an integer value in the case where the floating-point value falls outside the range of the integer format is left undefined by IEEE-754. For compute devices, the behavior is to clamp to the end of the supported range. This is unlike the x86 architecture behavior.
在 IEEE-754 中,当浮点值超出整数格式范围时,浮点值转换为整数值的行为是未定义的。对于计算设备,行为是将其夹紧到支持范围的末端。这与 x86 架构的行为不同。

The behavior of integer division by zero and integer overflow is left undefined by IEEE-754. For compute devices, there is no mechanism for detecting that such integer operation exceptions have occurred. Integer division by zero yields an unspecified, machine-specific value.
整数除零和整数溢出的行为由 IEEE-754 定义为未定义。对于计算设备,没有机制可以检测到这些整数操作异常已发生。整数除零会产生一个未指定的、机器特定的值。

https://developer.nvidia.com/content/precision-performance-floating-point-and-ieee-754-compliance-nvidia-gpus includes more information on the floating point accuracy and compliance of NVIDIA GPUs.
https://developer.nvidia.com/content/precision-performance-floating-point-and-ieee-754-compliance-nvidia-gpus 包含了有关 NVIDIA GPU 浮点精度和符合性的更多信息。

16.4. Compute Capability 5.x
16.4. 计算能力 5.x 

16.4.1. Architecture 16.4.1. 架构 

An SM consists of:
一个 SM 由以下组成:

  • 128 CUDA cores for arithmetic operations (see Arithmetic Instructions for throughputs of arithmetic operations),
    128 个 CUDA 核心用于算术运算(请参阅算术指令以获取算术运算的吞吐量),

  • 32 special function units for single-precision floating-point transcendental functions,
    32 个用于单精度浮点超越函数的特殊功能单元,

  • 4 warp schedulers. 4 个 warp 调度器。

When an SM is given warps to execute, it first distributes them among the four schedulers. Then, at every instruction issue time, each scheduler issues one instruction for one of its assigned warps that is ready to execute, if any.
当给定一组 warp 来执行时,首先将它们分配给四个调度器。然后,在每个指令发出时间点,每个调度器为其准备执行的一个分配的 warp 发出一条指令,如果有的话。

An SM has: 一个 SM 有:

  • a read-only constant cache that is shared by all functional units and speeds up reads from the constant memory space, which resides in device memory,
    一个只读的常量缓存,被所有功能单元共享,加快从设备内存中的常量内存空间读取的速度

  • a unified L1/texture cache of 24 KB used to cache reads from global memory,
    一个用于缓存来自全局内存读取的 24 KB 统一的 L1/纹理缓存

  • 64 KB of shared memory for devices of compute capability 5.0 or 96 KB of shared memory for devices of compute capability 5.2.
    设备的计算能力为 5.0 的共享内存为 64 KB,计算能力为 5.2 的设备的共享内存为 96 KB。

The unified L1/texture cache is also used by the texture unit that implements the various addressing modes and data filtering mentioned in Texture and Surface Memory.
统一的 L1/纹理缓存也被纹理单元使用,该单元实现了纹理和表面内存中提到的各种寻址模式和数据过滤。

There is also an L2 cache shared by all SMs that is used to cache accesses to local or global memory, including temporary register spills. Applications may query the L2 cache size by checking the l2CacheSize device property (see Device Enumeration).
所有 SM 共享的 L2 缓存也用于缓存对本地或全局内存的访问,包括临时寄存器溢出。应用程序可以通过检查 l2CacheSize 设备属性(请参阅设备枚举)来查询 L2 缓存大小。

The cache behavior (e.g., whether reads are cached in both the unified L1/texture cache and L2 or in L2 only) can be partially configured on a per-access basis using modifiers to the load instruction.
缓存行为(例如,读取是否在统一的 L1/纹理缓存和 L2 中缓存,或仅在 L2 中缓存)可以通过对加载指令使用修饰符,在每次访问时部分配置。

16.4.2. Global Memory
16.4.2. 全局内存 

Global memory accesses are always cached in L2.
全局内存访问总是在 L2 中缓存。

Data that is read-only for the entire lifetime of the kernel can also be cached in the unified L1/texture cache described in the previous section by reading it using the __ldg() function (see Read-Only Data Cache Load Function). When the compiler detects that the read-only condition is satisfied for some data, it will use __ldg() to read it. The compiler might not always be able to detect that the read-only condition is satisfied for some data. Marking pointers used for loading such data with both the const and __restrict__ qualifiers increases the likelihood that the compiler will detect the read-only condition.
内核整个生命周期中只读的数据也可以通过使用 __ldg() 函数(请参阅只读数据缓存加载函数)读取,并缓存在前一节中描述的统一 L1/纹理缓存中。当编译器检测到某些数据满足只读条件时,它将使用 __ldg() 来读取。编译器可能并非总是能够检测到某些数据满足只读条件。使用 const__restrict__ 限定符标记用于加载此类数据的指针,可以增加编译器检测只读条件的可能性。

Data that is not read-only for the entire lifetime of the kernel cannot be cached in the unified L1/texture cache for devices of compute capability 5.0. For devices of compute capability 5.2, it is, by default, not cached in the unified L1/texture cache, but caching may be enabled using the following mechanisms:
无法在内核整个生命周期中不是只读的数据缓存在计算能力为 5.0 的设备的统一 L1/纹理缓存中。对于计算能力为 5.2 的设备,默认情况下不会缓存在统一 L1/纹理缓存中,但可以使用以下机制启用缓存:

  • Perform the read using inline assembly with the appropriate modifier as described in the PTX reference manual;
    使用内联汇编执行读取操作,使用 PTX 参考手册中描述的适当修饰符

  • Compile with the -Xptxas -dlcm=ca compilation flag, in which case all reads are cached, except reads that are performed using inline assembly with a modifier that disables caching;
    使用 -Xptxas -dlcm=ca 编译标志进行编译,此时所有读取都会被缓存,除非使用禁用缓存的修饰符执行内联汇编的读取;

  • Compile with the -Xptxas -fscm=ca compilation flag, in which case all reads are cached, including reads that are performed using inline assembly regardless of the modifier used.
    使用 -Xptxas -fscm=ca 编译标志进行编译,这样所有读取都会被缓存,包括使用内联汇编执行的读取,无论使用的修饰符是什么。

When caching is enabled using one of the three mechanisms listed above, devices of compute capability 5.2 will cache global memory reads in the unified L1/texture cache for all kernel launches except for the kernel launches for which thread blocks consume too much of the SM’s register file. These exceptions are reported by the profiler.
当使用上述三种机制之一启用缓存时,计算能力为 5.2 的设备将在统一的 L1/纹理缓存中缓存全局内存读取,除了那些线程块消耗了太多 SM 寄存器文件的内核启动。这些异常情况由性能分析器报告。

16.4.3. Shared Memory
16.4.3. 共享内存 

Shared memory has 32 banks that are organized such that successive 32-bit words map to successive banks. Each bank has a bandwidth of 32 bits per clock cycle.
共享内存有 32 个银行,这些银行被组织成这样,以便连续的 32 位字映射到连续的银行。每个银行在每个时钟周期有 32 位的带宽。

A shared memory request for a warp does not generate a bank conflict between two threads that access any address within the same 32-bit word (even though the two addresses fall in the same bank). In that case, for read accesses, the word is broadcast to the requesting threads and for write accesses, each address is written by only one of the threads (which thread performs the write is undefined).
一个 warp 的共享内存请求不会在两个访问同一个 32 位字内任何地址的线程之间生成银行冲突(即使这两个地址位于同一个银行)。在这种情况下,对于读访问,该字会被广播到请求的线程,对于写访问,每个地址只会被一个线程写入(执行写入的线程未定义)。

Figure 22 shows some examples of strided access.
图 22 显示了一些步进访问的示例。

Figure 23 shows some examples of memory read accesses that involve the broadcast mechanism.
图 23 显示了涉及广播机制的一些内存读取访问示例。

Strided Shared Memory Accesses in 32 bit bank size mode.

Figure 35 Strided Shared Memory Accesses in 32 bit bank size mode.
图 35 以 32 位银行大小模式进行跨步共享内存访问。 

Left 

Linear addressing with a stride of one 32-bit word (no bank conflict).
使用一字 32 位的线性寻址(无银行冲突)。

Middle 中间

Linear addressing with a stride of two 32-bit words (two-way bank conflict).
使用两个 32 位字的步幅进行线性寻址(两路银行冲突)。

Right 正确

Linear addressing with a stride of three 32-bit words (no bank conflict).
使用三个 32 位字的步幅进行线性寻址(无银行冲突)。

Irregular Shared Memory Accesses.

Figure 36 Irregular Shared Memory Accesses.
图 36 不规则共享内存访问。 

Left 

Conflict-free access via random permutation.
通过随机排列实现无冲突访问。

Middle 中间

Conflict-free access since threads 3, 4, 6, 7, and 9 access the same word within bank 5.
由于线程 3、4、6、7 和 9 访问银行 5 中的同一字,因此可以无冲突地访问。

Right 正确

Conflict-free broadcast access (threads access the same word within a bank).
冲突自由广播访问(线程访问同一银行内的相同字)。

16.5. Compute Capability 6.x
16.5. 计算能力 6.x 

16.5.1. Architecture 16.5.1. 架构 

An SM consists of:
一个 SM 由以下组成:

  • 64 (compute capability 6.0) or 128 (6.1 and 6.2) CUDA cores for arithmetic operations,
    64(计算能力 6.0)或 128(6.1 和 6.2)个 CUDA 核心用于算术运算。

  • 16 (6.0) or 32 (6.1 and 6.2) special function units for single-precision floating-point transcendental functions,
    16(6.0)个或 32(6.1 和 6.2)个用于单精度浮点超越函数的特殊功能单元。

  • 2 (6.0) or 4 (6.1 and 6.2) warp schedulers.
    2(6.0)或 4(6.1 和 6.2)个 warp 调度器。

When an SM is given warps to execute, it first distributes them among its schedulers. Then, at every instruction issue time, each scheduler issues one instruction for one of its assigned warps that is ready to execute, if any.
当给定 SM 要执行的 warp 时,首先将它们分配给其调度器。然后,在每个指令发出时间,每个调度器为其已准备好执行的分配的 warp 之一发出一条指令,如果有的话。

An SM has: 一个 SM 有:

  • a read-only constant cache that is shared by all functional units and speeds up reads from the constant memory space, which resides in device memory,
    一个只读的常量缓存,被所有功能单元共享,加快从设备内存中的常量内存空间读取的速度

  • a unified L1/texture cache for reads from global memory of size 24 KB (6.0 and 6.2) or 48 KB (6.1),
    一个统一的大小为 24 KB(6.0 和 6.2)或 48 KB(6.1)的用于从全局内存读取的 L1/纹理缓存。

  • a shared memory of size 64 KB (6.0 and 6.2) or 96 KB (6.1).
    一个大小为 64 KB(6.0 和 6.2)或 96 KB(6.1)的共享内存。

The unified L1/texture cache is also used by the texture unit that implements the various addressing modes and data filtering mentioned in Texture and Surface Memory.
统一的 L1/纹理缓存也被纹理单元使用,该单元实现了纹理和表面内存中提到的各种寻址模式和数据过滤。

There is also an L2 cache shared by all SMs that is used to cache accesses to local or global memory, including temporary register spills. Applications may query the L2 cache size by checking the l2CacheSize device property (see Device Enumeration).
所有 SM 共享的 L2 缓存也用于缓存对本地或全局内存的访问,包括临时寄存器溢出。应用程序可以通过检查 l2CacheSize 设备属性(请参阅设备枚举)来查询 L2 缓存大小。

The cache behavior (e.g., whether reads are cached in both the unified L1/texture cache and L2 or in L2 only) can be partially configured on a per-access basis using modifiers to the load instruction.
缓存行为(例如,读取是否在统一的 L1/纹理缓存和 L2 中缓存,或仅在 L2 中缓存)可以通过对加载指令使用修饰符,在每次访问时部分配置。

16.5.2. Global Memory
16.5.2. 全局内存 

Global memory behaves the same way as in devices of compute capability 5.x (See Global Memory).
全局内存的行为方式与计算能力为 5.x 的设备相同(请参阅全局内存)。

16.5.3. Shared Memory
16.5.3. 共享内存 

Shared memory behaves the same way as in devices of compute capability 5.x (See Shared Memory).
共享内存的行为与计算能力为 5.x 的设备中的行为相同(请参阅共享内存)。

16.6. Compute Capability 7.x
16.6. 计算能力 7.x 

16.6.1. Architecture 16.6.1. 架构 

An SM consists of:
一个 SM 由以下组成:

  • 64 FP32 cores for single-precision arithmetic operations,
    64 个 FP32 核心用于单精度算术运算

  • 32 FP64 cores for double-precision arithmetic operations,34
    32 个 FP64 核心用于双精度算术运算,34

  • 64 INT32 cores for integer math,
    64 个 INT32 核心用于整数运算

  • 8 mixed-precision Tensor Cores for deep learning matrix arithmetic
    8 个混合精度张量核心,用于深度学习矩阵运算

  • 16 special function units for single-precision floating-point transcendental functions,
    16 个用于单精度浮点超越函数的特殊功能单元,

  • 4 warp schedulers. 4 个 warp 调度器。

An SM statically distributes its warps among its schedulers. Then, at every instruction issue time, each scheduler issues one instruction for one of its assigned warps that is ready to execute, if any.
SM 在其调度器之间静态分配其 warp。然后,在每个指令发出时间,每个调度器为其已准备执行的分配 warp 之一发出一条指令,如果有的话。

An SM has: 一个 SM 有:

  • a read-only constant cache that is shared by all functional units and speeds up reads from the constant memory space, which resides in device memory,
    一个只读的常量缓存,被所有功能单元共享,加快从设备内存中的常量内存空间读取的速度

  • a unified data cache and shared memory with a total size of 128 KB (Volta) or 96 KB (Turing).
    具有总大小为 128 KB(Volta)或 96 KB(Turing)的统一数据缓存和共享内存。

Shared memory is partitioned out of unified data cache, and can be configured to various sizes (See Shared Memory.) The remaining data cache serves as an L1 cache and is also used by the texture unit that implements the various addressing and data filtering modes mentioned in Texture and Surface Memory.
共享内存被划分出统一数据缓存,并可以配置为各种大小(请参阅共享内存)。其余的数据缓存用作 L1 缓存,也被纹理单元使用,该单元实现了纹理和表面内存中提到的各种寻址和数据过滤模式。

16.6.2. Independent Thread Scheduling
16.6.2. 独立线程调度 

The Volta architecture introduces Independent Thread Scheduling among threads in a warp, enabling intra-warp synchronization patterns previously unavailable and simplifying code changes when porting CPU code. However, this can lead to a rather different set of threads participating in the executed code than intended if the developer made assumptions about warp-synchronicity of previous hardware architectures.
Volta 架构引入了独立线程调度,使得 warp 内的线程之间可以进行独立调度,从而实现了以前不可用的 warp 内同步模式,并在将 CPU 代码移植时简化了代码更改。然而,如果开发人员对以前的硬件架构的 warp 同步性做出了假设,这可能导致参与执行代码的线程集合与预期的有所不同。

Below are code patterns of concern and suggested corrective actions for Volta-safe code.
以下是关注的代码模式和 Volta 安全代码的建议纠正操作。

  1. For applications using warp intrinsics (__shfl*, __any, __all, __ballot), it is necessary that developers port their code to the new, safe, synchronizing counterpart, with the *_sync suffix. The new warp intrinsics take in a mask of threads that explicitly define which lanes (threads of a warp) must participate in the warp intrinsic. See Warp Vote Functions and Warp Shuffle Functions for details.
    对于使用 warp 内在函数( __shfl*__any__all__ballot )的应用程序,开发人员需要将其代码移植到新的、安全的、同步的对应函数,使用 *_sync 后缀是必要的。新的 warp 内在函数接受一个线程掩码,明确定义了哪些通道(warp 的线程)必须参与到 warp 内在函数中。有关详细信息,请参阅 Warp 投票函数和 Warp 洗牌函数。

Since the intrinsics are available with CUDA 9.0+, (if necessary) code can be executed conditionally with the following preprocessor macro:
由于内在函数可在 CUDA 9.0+中使用,(如果需要)可以使用以下预处理宏有条件地执行代码:

#if defined(CUDART_VERSION) && CUDART_VERSION >= 9000
// *_sync intrinsic
#endif

These intrinsics are available on all architectures, not just Volta or Turing, and in most cases a single code-base will suffice for all architectures. Note, however, that for Pascal and earlier architectures, all threads in mask must execute the same warp intrinsic instruction in convergence, and the union of all values in mask must be equal to the warp’s active mask. The following code pattern is valid on Volta, but not on Pascal or earlier architectures.
这些内在函数在所有架构上都可用,不仅仅是 Volta 或 Turing,并且在大多数情况下,单个代码库就足够适用于所有架构。但是,请注意,对于 Pascal 和更早的架构,掩码中的所有线程必须在收敛中执行相同的 warp 内在指令,并且掩码中所有值的并集必须等于 warp 的活动掩码。以下代码模式在 Volta 上有效,但在 Pascal 或更早的架构上无效。

if (tid % warpSize < 16) {
    ...
    float swapped = __shfl_xor_sync(0xffffffff, val, 16);
    ...
} else {
    ...
    float swapped = __shfl_xor_sync(0xffffffff, val, 16);
    ...
}

The replacement for __ballot(1) is __activemask(). Note that threads within a warp can diverge even within a single code path. As a result, __activemask() and __ballot(1) may return only a subset of the threads on the current code path. The following invalid code example sets bit i of output to 1 when data[i] is greater than threshold. __activemask() is used in an attempt to enable cases where dataLen is not a multiple of 32.
替换 __ballot(1) 的是 __activemask() 。请注意,即使在单个代码路径中,warp 内的线程也可能分歧。因此, __activemask()__ballot(1) 可能仅返回当前代码路径上的部分线程。以下无效的代码示例在 data[i] 大于 threshold 时将 output 的第 i 位设置为 1。 __activemask() 用于尝试启用 dataLen 不是 32 的倍数的情况。

// Sets bit in output[] to 1 if the correspond element in data[i]
// is greater than 'threshold', using 32 threads in a warp.

for (int i = warpLane; i < dataLen; i += warpSize) {
    unsigned active = __activemask();
    unsigned bitPack = __ballot_sync(active, data[i] > threshold);
    if (warpLane == 0) {
        output[i / 32] = bitPack;
    }
}

This code is invalid because CUDA does not guarantee that the warp will diverge ONLY at the loop condition. When divergence happens for other reasons, conflicting results will be computed for the same 32-bit output element by different subsets of threads in the warp. A correct code might use a non-divergent loop condition together with __ballot_sync() to safely enumerate the set of threads in the warp participating in the threshold calculation as follows.
此代码无效,因为 CUDA 不能保证 warp 仅在循环条件处分歧。当出现其他原因导致分歧时,warp 中不同子线程集合将为相同的 32 位输出元素计算冲突的结果。正确的代码可能会使用非分歧的循环条件以及 __ballot_sync() 来安全地枚举参与阈值计算的 warp 中线程集合。

for (int i = warpLane; i - warpLane < dataLen; i += warpSize) {
    unsigned active = __ballot_sync(0xFFFFFFFF, i < dataLen);
    if (i < dataLen) {
        unsigned bitPack = __ballot_sync(active, data[i] > threshold);
        if (warpLane == 0) {
            output[i / 32] = bitPack;
        }
    }
}

Discovery Pattern demonstrates a valid use case for __activemask().
发现模式展示了 __activemask() 的有效用例。

  1. If applications have warp-synchronous codes, they will need to insert the new __syncwarp() warp-wide barrier synchronization instruction between any steps where data is exchanged between threads via global or shared memory. Assumptions that code is executed in lockstep or that reads/writes from separate threads are visible across a warp without synchronization are invalid.
    如果应用程序具有 warp 同步代码,则需要在通过全局或共享内存交换数据的任何步骤之间插入新的 __syncwarp() warp-wide 屏障同步指令。假设代码是以锁步方式执行的,或者从不同线程读取/写入的数据在没有同步的情况下在 warp 中可见是无效的。

    __shared__ float s_buff[BLOCK_SIZE];
    s_buff[tid] = val;
    __syncthreads();
    
    // Inter-warp reduction
    for (int i = BLOCK_SIZE / 2; i >= 32; i /= 2) {
        if (tid < i) {
            s_buff[tid] += s_buff[tid+i];
        }
        __syncthreads();
    }
    
    // Intra-warp reduction
    // Butterfly reduction simplifies syncwarp mask
    if (tid < 32) {
        float temp;
        temp = s_buff[tid ^ 16]; __syncwarp();
        s_buff[tid] += temp;     __syncwarp();
        temp = s_buff[tid ^ 8];  __syncwarp();
        s_buff[tid] += temp;     __syncwarp();
        temp = s_buff[tid ^ 4];  __syncwarp();
        s_buff[tid] += temp;     __syncwarp();
        temp = s_buff[tid ^ 2];  __syncwarp();
        s_buff[tid] += temp;     __syncwarp();
    }
    
    if (tid == 0) {
        *output = s_buff[0] + s_buff[1];
    }
    __syncthreads();
    
  2. Although __syncthreads() has been consistently documented as synchronizing all threads in the thread block, Pascal and prior architectures could only enforce synchronization at the warp level. In certain cases, this allowed a barrier to succeed without being executed by every thread as long as at least some thread in every warp reached the barrier. Starting with Volta, the CUDA built-in __syncthreads() and PTX instruction bar.sync (and their derivatives) are enforced per thread and thus will not succeed until reached by all non-exited threads in the block. Code exploiting the previous behavior will likely deadlock and must be modified to ensure that all non-exited threads reach the barrier.
    尽管 __syncthreads() 一直被记录为同步线程块中的所有线程,但 Pascal 和之前的架构只能在 warp 级别强制执行同步。在某些情况下,只要每个 warp 中至少有一些线程到达屏障,就可以使屏障成功而不被每个线程执行。从 Volta 开始,CUDA 内置的 __syncthreads() 和 PTX 指令 bar.sync (及其派生指令)被强制执行到每个线程,因此直到块中所有未退出的线程到达为止才会成功。利用先前行为的代码可能会发生死锁,必须修改以确保所有未退出的线程都到达屏障。

The racecheck and synccheck tools provided by compute-saniter can help with locating violations.
compute-saniter 提供的 racechecksynccheck 工具可以帮助定位违规行为。

To aid migration while implementing the above-mentioned corrective actions, developers can opt-in to the Pascal scheduling model that does not support independent thread scheduling. See Application Compatibility for details.
为了在实施上述纠正措施的同时帮助迁移,开发人员可以选择使用不支持独立线程调度的 Pascal 调度模型。有关详细信息,请参阅应用程序兼容性。

16.6.3. Global Memory
16.6.3. 全局内存 

Global memory behaves the same way as in devices of compute capability 5.x (See Global Memory).
全局内存的行为方式与计算能力为 5.x 的设备相同(请参阅全局内存)。

16.6.4. Shared Memory
16.6.4. 共享内存 

The amount of the unified data cache reserved for shared memory is configurable on a per kernel basis. For the Volta architecture (compute capability 7.0), the unified data cache has a size of 128 KB, and the shared memory capacity can be set to 0, 8, 16, 32, 64 or 96 KB. For the Turing architecture (compute capability 7.5), the unified data cache has a size of 96 KB, and the shared memory capacity can be set to either 32 KB or 64 KB. Unlike Kepler, the driver automatically configures the shared memory capacity for each kernel to avoid shared memory occupancy bottlenecks while also allowing concurrent execution with already launched kernels where possible. In most cases, the driver’s default behavior should provide optimal performance.
统一数据缓存为共享内存保留的数量可以根据每个内核进行配置。对于 Volta 架构(计算能力 7.0),统一数据缓存的大小为 128 KB,共享内存容量可以设置为 0、8、16、32、64 或 96 KB。对于 Turing 架构(计算能力 7.5),统一数据缓存的大小为 96 KB,共享内存容量可以设置为 32 KB 或 64 KB。与 Kepler 不同,驱动程序会自动为每个内核配置共享内存容量,以避免共享内存占用瓶颈,同时在可能的情况下允许与已启动内核并发执行。在大多数情况下,驱动程序的默认行为应该提供最佳性能。

Because the driver is not always aware of the full workload, it is sometimes useful for applications to provide additional hints regarding the desired shared memory configuration. For example, a kernel with little or no shared memory use may request a larger carveout in order to encourage concurrent execution with later kernels that require more shared memory. The new cudaFuncSetAttribute() API allows applications to set a preferred shared memory capacity, or carveout, as a percentage of the maximum supported shared memory capacity (96 KB for Volta, and 64 KB for Turing).
由于驱动程序并不总是了解完整的工作负载,因此应用程序有时会提供有关所需共享内存配置的额外提示是很有用的。例如,几乎没有共享内存使用的内核可能会请求更大的切割以鼓励与后续需要更多共享内存的内核并发执行。新的 cudaFuncSetAttribute() API 允许应用程序设置首选共享内存容量,或 carveout ,作为最大支持的共享内存容量的百分比(Volta 为 96 KB,Turing 为 64 KB)。

cudaFuncSetAttribute() relaxes enforcement of the preferred shared capacity compared to the legacy cudaFuncSetCacheConfig() API introduced with Kepler. The legacy API treated shared memory capacities as hard requirements for kernel launch. As a result, interleaving kernels with different shared memory configurations would needlessly serialize launches behind shared memory reconfigurations. With the new API, the carveout is treated as a hint. The driver may choose a different configuration if required to execute the function or to avoid thrashing.
cudaFuncSetAttribute() 相对于 Kepler 引入的传统 cudaFuncSetCacheConfig() API 放宽了对首选共享容量的执行。传统 API 将共享内存容量视为内核启动的硬性要求。因此,交错使用具有不同共享内存配置的内核将不必要地使共享内存重新配置背后的启动序列化。使用新 API,切割被视为提示。如果需要执行函数或避免抖动,驱动程序可能会选择不同的配置。

// Device code
__global__ void MyKernel(...)
{
    __shared__ float buffer[BLOCK_DIM];
    ...
}

// Host code
int carveout = 50; // prefer shared memory capacity 50% of maximum
// Named Carveout Values:
// carveout = cudaSharedmemCarveoutDefault;   //  (-1)
// carveout = cudaSharedmemCarveoutMaxL1;     //   (0)
// carveout = cudaSharedmemCarveoutMaxShared; // (100)
cudaFuncSetAttribute(MyKernel, cudaFuncAttributePreferredSharedMemoryCarveout, carveout);
MyKernel <<<gridDim, BLOCK_DIM>>>(...);

In addition to an integer percentage, several convenience enums are provided as listed in the code comments above. Where a chosen integer percentage does not map exactly to a supported capacity (SM 7.0 devices support shared capacities of 0, 8, 16, 32, 64, or 96 KB), the next larger capacity is used. For instance, in the example above, 50% of the 96 KB maximum is 48 KB, which is not a supported shared memory capacity. Thus, the preference is rounded up to 64 KB.
除了整数百分比外,代码注释中列出了几个方便的枚举。如果所选整数百分比与支持的容量不完全匹配(SM 7.0 设备支持共享容量为 0、8、16、32、64 或 96 KB),则使用下一个更大的容量。例如,在上面的示例中,96 KB 最大容量的 50% 为 48 KB,这不是受支持的共享内存容量。因此,首选项将向上舍入为 64 KB。

Compute capability 7.x devices allow a single thread block to address the full capacity of shared memory: 96 KB on Volta, 64 KB on Turing. Kernels relying on shared memory allocations over 48 KB per block are architecture-specific, as such they must use dynamic shared memory (rather than statically sized arrays) and require an explicit opt-in using cudaFuncSetAttribute() as follows.
计算能力为 7.x 的设备允许单个线程块访问共享内存的全部容量:Volta 上为 96 KB,Turing 上为 64 KB。依赖于每个块超过 48 KB 的共享内存分配的内核是特定于架构的,因此它们必须使用动态共享内存(而不是静态大小的数组),并需要使用 cudaFuncSetAttribute() 明确选择。

// Device code
__global__ void MyKernel(...)
{
    extern __shared__ float buffer[];
    ...
}

// Host code
int maxbytes = 98304; // 96 KB
cudaFuncSetAttribute(MyKernel, cudaFuncAttributeMaxDynamicSharedMemorySize, maxbytes);
MyKernel <<<gridDim, blockDim, maxbytes>>>(...);

Otherwise, shared memory behaves the same way as for devices of compute capability 5.x (See Shared Memory).
否则,共享内存的行为与计算能力为 5.x 的设备相同(请参阅共享内存)。

16.7. Compute Capability 8.x
16.7. 计算能力 8.x 

16.7.1. Architecture 16.7.1. 架构 

A Streaming Multiprocessor (SM) consists of:
流式多处理器(SM)由:

  • 64 FP32 cores for single-precision arithmetic operations in devices of compute capability 8.0 and 128 FP32 cores in devices of compute capability 8.6, 8.7 and 8.9,
    在计算能力为 8.0 的设备中,有 64 个 FP32 核心用于单精度算术运算,在计算能力为 8.6、8.7 和 8.9 的设备中有 128 个 FP32 核心。

  • 32 FP64 cores for double-precision arithmetic operations in devices of compute capability 8.0 and 2 FP64 cores in devices of compute capability 8.6, 8.7 and 8.9
    在计算能力为 8.0 的设备中有 32 个 FP64 核心用于双精度算术运算,在计算能力为 8.6、8.7 和 8.9 的设备中有 2 个 FP64 核心。

  • 64 INT32 cores for integer math,
    64 个 INT32 核心用于整数运算

  • 4 mixed-precision Third-Generation Tensor Cores supporting half-precision (fp16), __nv_bfloat16, tf32, sub-byte and double precision (fp64) matrix arithmetic for compute capabilities 8.0, 8.6 and 8.7 (see Warp matrix functions for details),
    第三代张量核心支持混合精度(fp16)、 __nv_bfloat16tf32 、子字节和双精度(fp64)矩阵运算,用于计算能力 8.0、8.6 和 8.7(详细信息请参阅 Warp 矩阵函数),

  • 4 mixed-precision Fourth-Generation Tensor Cores supporting fp8, fp16, __nv_bfloat16, tf32, sub-byte and fp64 for compute capability 8.9 (see Warp matrix functions for details),
    4 个混合精度第四代张量核心,支持 fp8fp16__nv_bfloat16tf32 ,子字节和 fp64 ,适用于计算能力 8.9(有关详细信息,请参见 Warp 矩阵函数),

  • 16 special function units for single-precision floating-point transcendental functions,
    16 个用于单精度浮点超越函数的特殊功能单元,

  • 4 warp schedulers. 4 个 warp 调度器。

An SM statically distributes its warps among its schedulers. Then, at every instruction issue time, each scheduler issues one instruction for one of its assigned warps that is ready to execute, if any.
SM 在其调度器之间静态分配其 warp。然后,在每个指令发出时间,每个调度器为其已准备执行的分配 warp 之一发出一条指令,如果有的话。

An SM has: 一个 SM 有:

  • a read-only constant cache that is shared by all functional units and speeds up reads from the constant memory space, which resides in device memory,
    一个只读的常量缓存,被所有功能单元共享,加快从设备内存中的常量内存空间读取的速度

  • a unified data cache and shared memory with a total size of 192 KB for devices of compute capability 8.0 and 8.7 (1.5x Volta’s 128 KB capacity) and 128 KB for devices of compute capabilities 8.6 and 8.9.
    具有总大小为 192 KB 的统一数据缓存和共享内存,适用于计算能力为 8.0 和 8.7 的设备(是 Volta 128 KB 容量的 1.5 倍),以及适用于计算能力为 8.6 和 8.9 的设备的 128 KB。

Shared memory is partitioned out of the unified data cache, and can be configured to various sizes (see Shared Memory section). The remaining data cache serves as an L1 cache and is also used by the texture unit that implements the various addressing and data filtering modes mentioned in Texture and Surface Memory.
共享内存是从统一数据缓存中分配的,并可以配置为各种大小(请参阅共享内存部分)。其余的数据缓存用作 L1 缓存,也被纹理单元使用,该单元实现了纹理和表面内存中提到的各种寻址和数据过滤模式。

16.7.2. Global Memory
16.7.2. 全局内存 

Global memory behaves the same way as for devices of compute capability 5.x (See Global Memory).
全局内存的行为方式与计算能力为 5.x 的设备相同(请参阅全局内存)。

16.7.3. Shared Memory
16.7.3. 共享内存 

Similar to the Volta architecture, the amount of the unified data cache reserved for shared memory is configurable on a per kernel basis. For the NVIDIA Ampere GPU architecture, the unified data cache has a size of 192 KB for devices of compute capability 8.0 and 8.7 and 128 KB for devices of compute capabilities 8.6 and 8.9. The shared memory capacity can be set to 0, 8, 16, 32, 64, 100, 132 or 164 KB for devices of compute capability 8.0 and 8.7, and to 0, 8, 16, 32, 64 or 100 KB for devices of compute capabilities 8.6 and 8.9.
类似于 Volta 架构,用于共享内存的统一数据缓存量可以根据每个内核进行配置。对于 NVIDIA Ampere GPU 架构,对于计算能力为 8.0 和 8.7 的设备,统一数据缓存的大小为 192 KB,对于计算能力为 8.6 和 8.9 的设备,统一数据缓存的大小为 128 KB。共享内存容量可以设置为 0、8、16、32、64、100、132 或 164 KB,适用于计算能力为 8.0 和 8.7 的设备,以及可以设置为 0、8、16、32、64 或 100 KB,适用于计算能力为 8.6 和 8.9 的设备。

An application can set the carveout, i.e., the preferred shared memory capacity, with the cudaFuncSetAttribute().
一个应用程序可以使用 cudaFuncSetAttribute() 设置 carveout ,即首选共享内存容量。

cudaFuncSetAttribute(kernel_name, cudaFuncAttributePreferredSharedMemoryCarveout, carveout);

The API can specify the carveout either as an integer percentage of the maximum supported shared memory capacity of 164 KB for devices of compute capability 8.0 and 8.7 and 100 KB for devices of compute capabilities 8.6 and 8.9 respectively, or as one of the following values: {cudaSharedmemCarveoutDefault, cudaSharedmemCarveoutMaxL1, or cudaSharedmemCarveoutMaxShared. When using a percentage, the carveout is rounded up to the nearest supported shared memory capacity. For example, for devices of compute capability 8.0, 50% will map to a 100 KB carveout instead of an 82 KB one. Setting the cudaFuncAttributePreferredSharedMemoryCarveout is considered a hint by the driver; the driver may choose a different configuration, if needed.
API 可以将切割指定为最大支持的共享内存容量的整数百分比,对于计算能力为 8.0 和 8.7 的设备为 164 KB,对于计算能力为 8.6 和 8.9 的设备分别为 100 KB,或者作为以下值之一: {cudaSharedmemCarveoutDefaultcudaSharedmemCarveoutMaxL1cudaSharedmemCarveoutMaxShared 。使用百分比时,切割将四舍五入到最接近的支持的共享内存容量。例如,对于计算能力为 8.0 的设备,50%将映射到 100 KB 的切割,而不是 82 KB 的切割。设置 cudaFuncAttributePreferredSharedMemoryCarveout 被驱动程序视为提示;如果需要,驱动程序可以选择不同的配置。

Devices of compute capability 8.0 and 8.7 allow a single thread block to address up to 163 KB of shared memory, while devices of compute capabilities 8.6 and 8.9 allow up to 99 KB of shared memory. Kernels relying on shared memory allocations over 48 KB per block are architecture-specific, and must use dynamic shared memory rather than statically sized shared memory arrays. These kernels require an explicit opt-in by using cudaFuncSetAttribute() to set the cudaFuncAttributeMaxDynamicSharedMemorySize; see Shared Memory for the Volta architecture.
计算能力为 8.0 和 8.7 的设备允许单个线程块访问高达 163 KB 的共享内存,而计算能力为 8.6 和 8.9 的设备允许高达 99 KB 的共享内存。依赖于每个块超过 48 KB 的共享内存分配的内核是特定于架构的,必须使用动态共享内存而不是静态大小的共享内存数组。这些内核需要通过使用 cudaFuncSetAttribute() 来设置 cudaFuncAttributeMaxDynamicSharedMemorySize 来显式选择;请参阅 Volta 架构的共享内存。

Note that the maximum amount of shared memory per thread block is smaller than the maximum shared memory partition available per SM. The 1 KB of shared memory not made available to a thread block is reserved for system use.
请注意,每个线程块的共享内存最大量小于每个 SM 可用的最大共享内存分区。未提供给线程块的 1 KB 共享内存被保留供系统使用。

16.8. Compute Capability 9.0
16.8. 计算能力 9.0 

16.8.1. Architecture 16.8.1. 架构 

A Streaming Multiprocessor (SM) consists of:
流式多处理器(SM)由:

  • 128 FP32 cores for single-precision arithmetic operations,
    128 个 FP32 核心用于单精度算术运算

  • 64 FP64 cores for double-precision arithmetic operations,
    64 个 FP64 核心用于双精度算术运算

  • 64 INT32 cores for integer math,
    64 个 INT32 核心用于整数运算

  • 4 mixed-precision fourth-generation Tensor Cores supporting the new FP8 input type in either E4M3 or E5M2 for exponent (E) and mantissa (M), half-precision (fp16), __nv_bfloat16, tf32, INT8 and double precision (fp64) matrix arithmetic (see Warp Matrix Functions for details) with sparsity support,
    4 个支持新 FP8 输入类型的混合精度第四代张量核心,支持指数(E)和尾数(M)的 E4M3E5M2 ,半精度(fp16), __nv_bfloat16tf32 ,INT8 和双精度(fp64)矩阵运算(详见 Warp Matrix Functions 以获取详细信息),支持稀疏性,

  • 16 special function units for single-precision floating-point transcendental functions,
    16 个用于单精度浮点超越函数的特殊功能单元,

  • 4 warp schedulers. 4 个 warp 调度器。

An SM statically distributes its warps among its schedulers. Then, at every instruction issue time, each scheduler issues one instruction for one of its assigned warps that is ready to execute, if any.
SM 在其调度器之间静态分配其 warp。然后,在每个指令发出时间,每个调度器为其已准备执行的分配 warp 之一发出一条指令,如果有的话。

An SM has: 一个 SM 有:

  • a read-only constant cache that is shared by all functional units and speeds up reads from the constant memory space, which resides in device memory,
    一个只读的常量缓存,被所有功能单元共享,加快从设备内存中的常量内存空间读取的速度

  • a unified data cache and shared memory with a total size of 256 KB for devices of compute capability 9.0 (1.33x NVIDIA Ampere GPU Architecture’s 192 KB capacity).
    一个统一的数据缓存和共享内存,总大小为 256 KB,适用于计算能力为 9.0 的设备(1.33x NVIDIA Ampere GPU 架构的 192 KB 容量)。

Shared memory is partitioned out of the unified data cache, and can be configured to various sizes (see Shared Memory section). The remaining data cache serves as an L1 cache and is also used by the texture unit that implements the various addressing and data filtering modes mentioned in Texture and Surface Memory.
共享内存是从统一数据缓存中分配的,并可以配置为各种大小(请参阅共享内存部分)。其余的数据缓存用作 L1 缓存,也被纹理单元使用,该单元实现了纹理和表面内存中提到的各种寻址和数据过滤模式。

16.8.2. Global Memory
16.8.2. 全局内存 

Global memory behaves the same way as for devices of compute capability 5.x (See Global Memory).
全局内存的行为方式与计算能力为 5.x 的设备相同(请参阅全局内存)。

16.8.3. Shared Memory
16.8.3. 共享内存 

Similar to the NVIDIA Ampere GPU architecture, the amount of the unified data cache reserved for shared memory is configurable on a per kernel basis. For the NVIDIA H100 Tensor Core GPU architecture, the unified data cache has a size of 256 KB for devices of compute capability 9.0. The shared memory capacity can be set to 0, 8, 16, 32, 64, 100, 132, 164, 196 or 228 KB.
与 NVIDIA Ampere GPU 架构类似,用于共享内存的统一数据缓存量可根据每个内核进行配置。对于 NVIDIA H100 Tensor Core GPU 架构,计算能力为 9.0 的设备,统一数据缓存的大小为 256 KB。共享内存容量可设置为 0、8、16、32、64、100、132、164、196 或 228 KB。

As with the NVIDIA Ampere GPU architecture, an application can configure its preferred shared memory capacity, i.e., the carveout. Devices of compute capability 9.0 allow a single thread block to address up to 227 KB of shared memory. Kernels relying on shared memory allocations over 48 KB per block are architecture-specific, and must use dynamic shared memory rather than statically sized shared memory arrays. These kernels require an explicit opt-in by using cudaFuncSetAttribute() to set the cudaFuncAttributeMaxDynamicSharedMemorySize; see Shared Memory for the Volta architecture.
与 NVIDIA Ampere GPU 架构一样,应用程序可以配置其首选的共享内存容量,即 carveout 。计算能力为 9.0 的设备允许单个线程块访问多达 227 KB 的共享内存。依赖于每个块超过 48 KB 的共享内存分配的内核是特定于架构的,必须使用动态共享内存而不是静态大小的共享内存数组。这些内核需要通过使用 cudaFuncSetAttribute() 来显式选择加入 cudaFuncAttributeMaxDynamicSharedMemorySize ;请参阅 Volta 架构的共享内存。

Note that the maximum amount of shared memory per thread block is smaller than the maximum shared memory partition available per SM. The 1 KB of shared memory not made available to a thread block is reserved for system use.
请注意,每个线程块的共享内存最大量小于每个 SM 可用的最大共享内存分区。未提供给线程块的 1 KB 共享内存被保留供系统使用。

32

above 48 KB requires dynamic shared memory
超过 48 KB 需要动态共享内存

33

2 FP64 cores for double-precision arithmetic operations for devices of compute capabilities 7.5
设备的计算能力为 7.5 的双精度算术运算有 2 个 FP64 核心

34

2 FP64 cores for double-precision arithmetic operations for devices of compute capabilities 7.5
设备的计算能力为 7.5 的双精度算术运算有 2 个 FP64 核心

16.8.4. Features Accelerating Specialized Computations
16.8.4. 特殊计算加速功能 

The NVIDIA Hopper GPU architecture includes features to accelerate matrix multiply-accumulate (MMA) computations with:
NVIDIA Hopper GPU 架构包括用于加速矩阵乘-累加(MMA)计算的功能:

  • asynchronous execution of MMA instructions
    MMA 指令的异步执行

  • MMA instructions acting on large matrices spanning a warp-group
    MMA 指令作用于跨越 warp 组的大矩阵

  • dynamic reassignment of register capacity among warp-groups to support even larger matrices, and
    动态重新分配寄存器容量以支持更大的矩阵,以及

  • operand matrices accessed directly from shared memory
    直接从共享内存访问的操作数矩阵

This feature set is only available within the CUDA compilation toolchain through inline PTX.
此功能集仅在 CUDA 编译工具链中通过内联 PTX 可用。

It is strongly recommended that applications utilize this complex feature set through CUDA-X libraries such as cuBLAS, cuDNN, or cuFFT.
强烈建议应用程序通过 CUDA-X 库(如 cuBLAS、cuDNN 或 cuFFT)利用这一复杂功能集。

It is strongly recommended that device kernels utilize this complex feature set through CUTLASS, a collection of CUDA C++ template abstractions for implementing high-performance matrix-multiplication (GEMM) and related computations at all levels and scales within CUDA.
强烈建议设备内核通过 CUTLASS 利用这一复杂功能集,CUTLASS 是一组 CUDA C++模板抽象,用于在 CUDA 的所有级别和规模上实现高性能矩阵乘法(GEMM)和相关计算。

17. Driver API
17. 驱动程序 API 

This section assumes knowledge of the concepts described in CUDA Runtime.
本节假定您已了解 CUDA Runtime 中描述的概念。

The driver API is implemented in the cuda dynamic library (cuda.dll or cuda.so) which is copied on the system during the installation of the device driver. All its entry points are prefixed with cu.
驱动程序 API 实现在 cuda 动态库( cuda.dllcuda.so )中,该库在安装设备驱动程序时复制到系统中。所有入口点都以 cu 为前缀。

It is a handle-based, imperative API: Most objects are referenced by opaque handles that may be specified to functions to manipulate the objects.
这是一个基于句柄的命令式 API:大多数对象都是通过不透明句柄引用的,可以将这些句柄指定给函数来操作对象。

The objects available in the driver API are summarized in Table 22.
驱动程序 API 中可用的对象总结在表 22 中。

Table 22 Objects Available in the CUDA Driver API
CUDA Driver API 中可用的对象表 22 

Object 对象

Handle 处理

Description 描述

Device 设备

CUdevice

CUDA-enabled device CUDA 启用的设备

Context 上下文

CUcontext

Roughly equivalent to a CPU process
大致相当于 CPU 进程

Module 模块

CUmodule

Roughly equivalent to a dynamic library
大致相当于动态库

Function 功能

CUfunction

Kernel 内核

Heap memory 堆内存

CUdeviceptr

Pointer to device memory 指向设备内存的指针

CUDA array CUDA 数组

CUarray

Opaque container for one-dimensional or two-dimensional data on the device, readable via texture or surface references
设备上用于存储一维或二维数据的不透明容器,可通过纹理或表面引用进行读取

Texture object 纹理对象

CUtexref

Object that describes how to interpret texture memory data
描述如何解释纹理内存数据的对象

Surface reference 表面引用

CUsurfref

Object that describes how to read or write CUDA arrays
描述如何读取或写入 CUDA 数组的对象

Stream 

CUstream

Object that describes a CUDA stream
描述 CUDA 流的对象

Event 事件

CUevent

Object that describes a CUDA event
描述 CUDA 事件的对象

The driver API must be initialized with cuInit() before any function from the driver API is called. A CUDA context must then be created that is attached to a specific device and made current to the calling host thread as detailed in Context.
驱动程序 API 必须在调用驱动程序 API 的任何函数之前使用 cuInit() 进行初始化。然后必须创建一个附加到特定设备并使其成为调用主机线程的当前 CUDA 上下文,详细信息请参阅上下文。

Within a CUDA context, kernels are explicitly loaded as PTX or binary objects by the host code as described in Module. Kernels written in C++ must therefore be compiled separately into PTX or binary objects. Kernels are launched using API entry points as described in Kernel Execution.
在 CUDA 上下文中,内核通过主机代码显式加载为 PTX 或二进制对象,如 Module 中所述。因此,用 C++编写的内核必须单独编译为 PTX 或二进制对象。内核的启动使用 API 入口点,如 Kernel Execution 中所述。

Any application that wants to run on future device architectures must load PTX, not binary code. This is because binary code is architecture-specific and therefore incompatible with future architectures, whereas PTX code is compiled to binary code at load time by the device driver.
任何希望在未来设备架构上运行的应用程序都必须加载 PTX,而不是二进制代码。这是因为二进制代码是特定于架构的,因此与未来架构不兼容,而 PTX 代码会在加载时由设备驱动程序编译为二进制代码。

Here is the host code of the sample from Kernels written using the driver API:
这是使用驱动程序 API 编写的内核示例的主机代码:

int main()
{
    int N = ...;
    size_t size = N * sizeof(float);

    // Allocate input vectors h_A and h_B in host memory
    float* h_A = (float*)malloc(size);
    float* h_B = (float*)malloc(size);

    // Initialize input vectors
    ...

    // Initialize
    cuInit(0);

    // Get number of devices supporting CUDA
    int deviceCount = 0;
    cuDeviceGetCount(&deviceCount);
    if (deviceCount == 0) {
        printf("There is no device supporting CUDA.\n");
        exit (0);
    }

    // Get handle for device 0
    CUdevice cuDevice;
    cuDeviceGet(&cuDevice, 0);

    // Create context
    CUcontext cuContext;
    cuCtxCreate(&cuContext, 0, cuDevice);

    // Create module from binary file
    CUmodule cuModule;
    cuModuleLoad(&cuModule, "VecAdd.ptx");

    // Allocate vectors in device memory
    CUdeviceptr d_A;
    cuMemAlloc(&d_A, size);
    CUdeviceptr d_B;
    cuMemAlloc(&d_B, size);
    CUdeviceptr d_C;
    cuMemAlloc(&d_C, size);

    // Copy vectors from host memory to device memory
    cuMemcpyHtoD(d_A, h_A, size);
    cuMemcpyHtoD(d_B, h_B, size);

    // Get function handle from module
    CUfunction vecAdd;
    cuModuleGetFunction(&vecAdd, cuModule, "VecAdd");

    // Invoke kernel
    int threadsPerBlock = 256;
    int blocksPerGrid =
            (N + threadsPerBlock - 1) / threadsPerBlock;
    void* args[] = { &d_A, &d_B, &d_C, &N };
    cuLaunchKernel(vecAdd,
                   blocksPerGrid, 1, 1, threadsPerBlock, 1, 1,
                   0, 0, args, 0);

    ...
}

Full code can be found in the vectorAddDrv CUDA sample.
完整代码可以在 vectorAddDrv CUDA 示例中找到。

17.1. Context 17.1. 上下文 

A CUDA context is analogous to a CPU process. All resources and actions performed within the driver API are encapsulated inside a CUDA context, and the system automatically cleans up these resources when the context is destroyed. Besides objects such as modules and texture or surface references, each context has its own distinct address space. As a result, CUdeviceptr values from different contexts reference different memory locations.
CUDA 上下文类似于 CPU 进程。在驱动程序 API 中执行的所有资源和操作都封装在 CUDA 上下文中,系统在上下文被销毁时会自动清理这些资源。除了诸如模块和纹理或表面引用之类的对象外,每个上下文都有自己独特的地址空间。因此,来自不同上下文的 CUdeviceptr 值引用不同的内存位置。

A host thread may have only one device context current at a time. When a context is created with cuCtxCreate(), it is made current to the calling host thread. CUDA functions that operate in a context (most functions that do not involve device enumeration or context management) will return CUDA_ERROR_INVALID_CONTEXT if a valid context is not current to the thread.
主机线程一次只能有一个设备上下文处于当前状态。当使用 cuCtxCreate( 创建上下文时,它会被设置为调用主机线程的当前上下文。在上下文中运行的 CUDA 函数(大多数不涉及设备枚举或上下文管理的函数)如果当前线程没有有效的上下文,则会返回 CUDA_ERROR_INVALID_CONTEXT

Each host thread has a stack of current contexts. cuCtxCreate() pushes the new context onto the top of the stack. cuCtxPopCurrent() may be called to detach the context from the host thread. The context is then “floating” and may be pushed as the current context for any host thread. cuCtxPopCurrent() also restores the previous current context, if any.
每个主机线程都有一个当前上下文的堆栈。 cuCtxCreate() 将新上下文推送到堆栈顶部。 cuCtxPopCurrent() 可能会被调用以从主机线程分离上下文。然后上下文就会“漂浮”,可以作为任何主机线程的当前上下文推送。 cuCtxPopCurrent() 还会恢复先前的当前上下文(如果有的话)。

A usage count is also maintained for each context. cuCtxCreate() creates a context with a usage count of 1. cuCtxAttach() increments the usage count and cuCtxDetach() decrements it. A context is destroyed when the usage count goes to 0 when calling cuCtxDetach() or cuCtxDestroy().
对于每个上下文,还维护了一个使用计数。 cuCtxCreate() 创建一个使用计数为 1 的上下文。 cuCtxAttach() 增加使用计数, cuCtxDetach() 减少使用计数。当使用计数降至 0 时,调用 cuCtxDetach()cuCtxDestroy() 时销毁上下文。

The driver API is interoperable with the runtime and it is possible to access the primary context (see Initialization) managed by the runtime from the driver API via cuDevicePrimaryCtxRetain().
驱动程序 API 与运行时是可互操作的,可以通过 cuDevicePrimaryCtxRetain() 从驱动程序 API 访问由运行时管理的主要上下文(请参阅初始化)。

Usage count facilitates interoperability between third party authored code operating in the same context. For example, if three libraries are loaded to use the same context, each library would call cuCtxAttach() to increment the usage count and cuCtxDetach() to decrement the usage count when the library is done using the context. For most libraries, it is expected that the application will have created a context before loading or initializing the library; that way, the application can create the context using its own heuristics, and the library simply operates on the context handed to it. Libraries that wish to create their own contexts - unbeknownst to their API clients who may or may not have created contexts of their own - would use cuCtxPushCurrent() and cuCtxPopCurrent() as illustrated in the following figure.
使用计数有助于在相同上下文中运行的第三方编写的代码之间实现互操作性。例如,如果加载了三个库以使用相同的上下文,则每个库在使用上下文完成后会调用 cuCtxAttach() 来增加使用计数,调用 cuCtxDetach() 来减少使用计数。对于大多数库来说,预期应用程序在加载或初始化库之前已经创建了上下文;这样,应用程序可以使用自己的启发式方法创建上下文,而库只需在传递给它的上下文上操作。希望创建自己上下文的库 - 无论其 API 客户端是否已经创建了自己的上下文 - 将使用 cuCtxPushCurrent()cuCtxPopCurrent() ,如下图所示。

Library Context Management

Figure 37 Library Context Management
图 37 图书馆上下文管理 

17.2. Module 17.2. 模块 

Modules are dynamically loadable packages of device code and data, akin to DLLs in Windows, that are output by nvcc (see Compilation with NVCC). The names for all symbols, including functions, global variables, and texture or surface references, are maintained at module scope so that modules written by independent third parties may interoperate in the same CUDA context.
模块是动态可加载的设备代码和数据包,类似于 Windows 中的 DLL,在 nvcc(请参阅使用 NVCC 进行编译)输出。所有符号的名称,包括函数、全局变量以及纹理或表面引用,在模块范围内保持,以便由独立第三方编写的模块可以在相同的 CUDA 上下文中进行互操作。

This code sample loads a module and retrieves a handle to some kernel:
此代码示例加载一个模块并检索到一些内核的句柄:

CUmodule cuModule;
cuModuleLoad(&cuModule, "myModule.ptx");
CUfunction myKernel;
cuModuleGetFunction(&myKernel, cuModule, "MyKernel");

This code sample compiles and loads a new module from PTX code and parses compilation errors:
此代码示例编译并加载来自 PTX 代码的新模块,并解析编译错误:

#define BUFFER_SIZE 8192
CUmodule cuModule;
CUjit_option options[3];
void* values[3];
char* PTXCode = "some PTX code";
char error_log[BUFFER_SIZE];
int err;
options[0] = CU_JIT_ERROR_LOG_BUFFER;
values[0]  = (void*)error_log;
options[1] = CU_JIT_ERROR_LOG_BUFFER_SIZE_BYTES;
values[1]  = (void*)BUFFER_SIZE;
options[2] = CU_JIT_TARGET_FROM_CUCONTEXT;
values[2]  = 0;
err = cuModuleLoadDataEx(&cuModule, PTXCode, 3, options, values);
if (err != CUDA_SUCCESS)
    printf("Link error:\n%s\n", error_log);

This code sample compiles, links, and loads a new module from multiple PTX codes and parses link and compilation errors:
此代码示例编译、链接并加载来自多个 PTX 代码的新模块,并解析链接和编译错误:

#define BUFFER_SIZE 8192
CUmodule cuModule;
CUjit_option options[6];
void* values[6];
float walltime;
char error_log[BUFFER_SIZE], info_log[BUFFER_SIZE];
char* PTXCode0 = "some PTX code";
char* PTXCode1 = "some other PTX code";
CUlinkState linkState;
int err;
void* cubin;
size_t cubinSize;
options[0] = CU_JIT_WALL_TIME;
values[0] = (void*)&walltime;
options[1] = CU_JIT_INFO_LOG_BUFFER;
values[1] = (void*)info_log;
options[2] = CU_JIT_INFO_LOG_BUFFER_SIZE_BYTES;
values[2] = (void*)BUFFER_SIZE;
options[3] = CU_JIT_ERROR_LOG_BUFFER;
values[3] = (void*)error_log;
options[4] = CU_JIT_ERROR_LOG_BUFFER_SIZE_BYTES;
values[4] = (void*)BUFFER_SIZE;
options[5] = CU_JIT_LOG_VERBOSE;
values[5] = (void*)1;
cuLinkCreate(6, options, values, &linkState);
err = cuLinkAddData(linkState, CU_JIT_INPUT_PTX,
                    (void*)PTXCode0, strlen(PTXCode0) + 1, 0, 0, 0, 0);
if (err != CUDA_SUCCESS)
    printf("Link error:\n%s\n", error_log);
err = cuLinkAddData(linkState, CU_JIT_INPUT_PTX,
                    (void*)PTXCode1, strlen(PTXCode1) + 1, 0, 0, 0, 0);
if (err != CUDA_SUCCESS)
    printf("Link error:\n%s\n", error_log);
cuLinkComplete(linkState, &cubin, &cubinSize);
printf("Link completed in %fms. Linker Output:\n%s\n", walltime, info_log);
cuModuleLoadData(cuModule, cubin);
cuLinkDestroy(linkState);

Full code can be found in the ptxjit CUDA sample.
完整代码可以在 ptxjit CUDA 示例中找到。

17.3. Kernel Execution
17.3. 内核执行 

cuLaunchKernel() launches a kernel with a given execution configuration.
cuLaunchKernel() 使用给定的执行配置启动内核。

Parameters are passed either as an array of pointers (next to last parameter of cuLaunchKernel()) where the nth pointer corresponds to the nth parameter and points to a region of memory from which the parameter is copied, or as one of the extra options (last parameter of cuLaunchKernel()).
参数可以作为指针数组传递( cuLaunchKernel() 的倒数第二个参数),其中第 n 个指针对应于第 n 个参数,并指向一个内存区域,从中复制参数,或者作为额外选项之一( cuLaunchKernel() 的最后一个参数)。

When parameters are passed as an extra option (the CU_LAUNCH_PARAM_BUFFER_POINTER option), they are passed as a pointer to a single buffer where parameters are assumed to be properly offset with respect to each other by matching the alignment requirement for each parameter type in device code.
当参数作为额外选项传递时( CU_LAUNCH_PARAM_BUFFER_POINTER 选项),它们作为指向单个缓冲区的指针传递,其中假定参数相对于设备代码中每个参数类型的对齐要求正确偏移,以便彼此匹配。

Alignment requirements in device code for the built-in vector types are listed in Table 5. For all other basic types, the alignment requirement in device code matches the alignment requirement in host code and can therefore be obtained using __alignof(). The only exception is when the host compiler aligns double and long long (and long on a 64-bit system) on a one-word boundary instead of a two-word boundary (for example, using gcc’s compilation flag -mno-align-double) since in device code these types are always aligned on a two-word boundary.
设备代码中内置矢量类型的对齐要求列在表 5 中。对于所有其他基本类型,在设备代码中的对齐要求与主机代码中的对齐要求相匹配,因此可以使用 __alignof() 来获取。唯一的例外是当主机编译器将 doublelong long (在 64 位系统上还有 long )对齐到一个字边界而不是两个字边界时(例如,使用 gcc 的编译标志 -mno-align-double ),因为在设备代码中,这些类型始终对齐到两个字边界。

CUdeviceptr is an integer, but represents a pointer, so its alignment requirement is __alignof(void*).
CUdeviceptr 是一个整数,但表示一个指针,因此其对齐要求是 __alignof(void*)

The following code sample uses a macro (ALIGN_UP()) to adjust the offset of each parameter to meet its alignment requirement and another macro (ADD_TO_PARAM_BUFFER()) to add each parameter to the parameter buffer passed to the CU_LAUNCH_PARAM_BUFFER_POINTER option.
以下代码示例使用宏( ALIGN_UP() )来调整每个参数的偏移量,以满足其对齐要求,并使用另一个宏( ADD_TO_PARAM_BUFFER() )将每个参数添加到传递给 CU_LAUNCH_PARAM_BUFFER_POINTER 选项的参数缓冲区中。

#define ALIGN_UP(offset, alignment) \
      (offset) = ((offset) + (alignment) - 1) & ~((alignment) - 1)

char paramBuffer[1024];
size_t paramBufferSize = 0;

#define ADD_TO_PARAM_BUFFER(value, alignment)                   \
    do {                                                        \
        paramBufferSize = ALIGN_UP(paramBufferSize, alignment); \
        memcpy(paramBuffer + paramBufferSize,                   \
               &(value), sizeof(value));                        \
        paramBufferSize += sizeof(value);                       \
    } while (0)

int i;
ADD_TO_PARAM_BUFFER(i, __alignof(i));
float4 f4;
ADD_TO_PARAM_BUFFER(f4, 16); // float4's alignment is 16
char c;
ADD_TO_PARAM_BUFFER(c, __alignof(c));
float f;
ADD_TO_PARAM_BUFFER(f, __alignof(f));
CUdeviceptr devPtr;
ADD_TO_PARAM_BUFFER(devPtr, __alignof(devPtr));
float2 f2;
ADD_TO_PARAM_BUFFER(f2, 8); // float2's alignment is 8

void* extra[] = {
    CU_LAUNCH_PARAM_BUFFER_POINTER, paramBuffer,
    CU_LAUNCH_PARAM_BUFFER_SIZE,    &paramBufferSize,
    CU_LAUNCH_PARAM_END
};
cuLaunchKernel(cuFunction,
               blockWidth, blockHeight, blockDepth,
               gridWidth, gridHeight, gridDepth,
               0, 0, 0, extra);

The alignment requirement of a structure is equal to the maximum of the alignment requirements of its fields. The alignment requirement of a structure that contains built-in vector types, CUdeviceptr, or non-aligned double and long long, might therefore differ between device code and host code. Such a structure might also be padded differently. The following structure, for example, is not padded at all in host code, but it is padded in device code with 12 bytes after field f since the alignment requirement for field f4 is 16.
结构的对齐要求等于其字段的对齐要求的最大值。包含内置向量类型、 CUdeviceptr 、或非对齐 doublelong long 的结构的对齐要求可能因此在设备代码和主机代码之间有所不同。这样的结构可能会有不同的填充方式。例如,以下结构在主机代码中没有填充,但在设备代码中在字段 f 之后填充了 12 字节,因为字段 f4 的对齐要求为 16。

typedef struct {
    float  f;
    float4 f4;
} myStruct;

17.4. Interoperability between Runtime and Driver APIs
17.4. 运行时和驱动程序 API 之间的互操作性 

An application can mix runtime API code with driver API code.
一个应用程序可以将运行时 API 代码与驱动程序 API 代码混合在一起。

If a context is created and made current via the driver API, subsequent runtime calls will pick up this context instead of creating a new one.
如果通过驱动程序 API 创建并设置了一个上下文,后续的运行时调用将会使用这个上下文,而不是创建一个新的。

If the runtime is initialized (implicitly as mentioned in CUDA Runtime), cuCtxGetCurrent() can be used to retrieve the context created during initialization. This context can be used by subsequent driver API calls.
如果运行时已初始化(如 CUDA Runtime 中所述的隐式初始化), cuCtxGetCurrent() 可用于检索在初始化期间创建的上下文。此上下文可供后续驱动程序 API 调用使用。

The implicitly created context from the runtime is called the primary context (see Initialization). It can be managed from the driver API with the Primary Context Management functions.
从运行时隐式创建的上下文称为主要上下文(请参阅初始化)。可以使用主要上下文管理函数从驱动程序 API 进行管理。

Device memory can be allocated and freed using either API. CUdeviceptr can be cast to regular pointers and vice-versa:
设备内存可以使用任一 API 进行分配和释放。 CUdeviceptr 可以转换为常规指针,反之亦然:

CUdeviceptr devPtr;
float* d_data;

// Allocation using driver API
cuMemAlloc(&devPtr, size);
d_data = (float*)devPtr;

// Allocation using runtime API
cudaMalloc(&d_data, size);
devPtr = (CUdeviceptr)d_data;

In particular, this means that applications written using the driver API can invoke libraries written using the runtime API (such as cuFFT, cuBLAS, …).
这意味着使用驱动程序 API 编写的应用程序可以调用使用运行时 API 编写的库(如 cuFFT、cuBLAS 等)。

All functions from the device and version management sections of the reference manual can be used interchangeably.
参考手册的设备和版本管理部分的所有功能都可以互换使用。

17.5. Driver Entry Point Access
17.5. 驱动程序入口点访问 

17.5.1. Introduction 17.5.1. 介绍 

The Driver Entry Point Access APIs provide a way to retrieve the address of a CUDA driver function. Starting from CUDA 11.3, users can call into available CUDA driver APIs using function pointers obtained from these APIs.
Driver Entry Point Access APIs 提供了一种检索 CUDA 驱动程序函数地址的方法。从 CUDA 11.3 开始,用户可以使用从这些 API 获取的函数指针调用可用的 CUDA 驱动程序 API。

These APIs provide functionality similar to their counterparts, dlsym on POSIX platforms and GetProcAddress on Windows. The provided APIs will let users:
这些 API 提供了类似于它们在 POSIX 平台上的对应物 dlsym 和在 Windows 上的 GetProcAddress 的功能。提供的 API 将允许用户:

  • Retrieve the address of a driver function using the CUDA Driver API.
    使用 CUDA Driver API. 检索驱动程序函数的地址

  • Retrieve the address of a driver function using the CUDA Runtime API.
    使用 CUDA Runtime API. 检索驱动程序函数的地址

  • Request per-thread default stream version of a CUDA driver function. For more details, see Retrieve per-thread default stream versions
    请求每个线程的 CUDA 驱动程序函数的默认流版本。有关更多详细信息,请参阅检索每个线程的默认流版本

  • Access new CUDA features on older toolkits but with a newer driver.
    在较旧的工具包上访问新的 CUDA 功能,但使用更新的驱动程序。

17.5.2. Driver Function Typedefs
17.5.2. 驱动程序函数类型定义 

To help retrieve the CUDA Driver API entry points, the CUDA Toolkit provides access to headers containing the function pointer definitions for all CUDA driver APIs. These headers are installed with the CUDA Toolkit and are made available in the toolkit’s include/ directory. The table below summarizes the header files containing the typedefs for each CUDA API header file.
为了帮助检索 CUDA Driver API 入口点,CUDA Toolkit 提供了访问包含所有 CUDA 驱动程序 API 函数指针定义的头文件的功能。这些头文件与 CUDA Toolkit 一起安装,并在工具包的 include/ 目录中提供。下表总结了包含每个 CUDA API 头文件的 typedefs 的头文件。

Table 23 Typedefs header files for CUDA driver APIs
CUDA 驱动程序 API 的表 23 Typedefs 头文件

API header file API 头文件

API Typedef header file API Typedef 头文件

cuda.h

cudaTypedefs.h

cudaGL.h

cudaGLTypedefs.h

cudaProfiler.h

cudaProfilerTypedefs.h

cudaVDPAU.h

cudaVDPAUTypedefs.h

cudaEGL.h

cudaEGLTypedefs.h

cudaD3D9.h

cudaD3D9Typedefs.h

cudaD3D10.h

cudaD3D10Typedefs.h

cudaD3D11.h

cudaD3D11Typedefs.h

The above headers do not define actual function pointers themselves; they define the typedefs for function pointers. For example, cudaTypedefs.h has the below typedefs for the driver API cuMemAlloc:
上述标头并不定义实际的函数指针本身;它们定义了函数指针的 typedef。例如, cudaTypedefs.h 为驱动程序 API cuMemAlloc 定义了以下 typedef:

typedef CUresult (CUDAAPI *PFN_cuMemAlloc_v3020)(CUdeviceptr_v2 *dptr, size_t bytesize);
typedef CUresult (CUDAAPI *PFN_cuMemAlloc_v2000)(CUdeviceptr_v1 *dptr, unsigned int bytesize);

CUDA driver symbols have a version based naming scheme with a _v* extension in its name except for the first version. When the signature or the semantics of a specific CUDA driver API changes, we increment the version number of the corresponding driver symbol. In the case of the cuMemAlloc driver API, the first driver symbol name is cuMemAlloc and the next symbol name is cuMemAlloc_v2. The typedef for the first version which was introduced in CUDA 2.0 (2000) is PFN_cuMemAlloc_v2000. The typedef for the next version which was introduced in CUDA 3.2 (3020) is PFN_cuMemAlloc_v3020.
CUDA 驱动程序符号采用基于版本的命名方案,除了第一个版本外,其名称中都带有 _v* 扩展名。当特定 CUDA 驱动程序 API 的签名或语义发生更改时,我们会递增相应驱动程序符号的版本号。对于 cuMemAlloc 驱动程序 API,第一个驱动程序符号名称为 cuMemAlloc ,下一个符号名称为 cuMemAlloc_v2 。在 CUDA 2.0(2000 年)中引入的第一个版本的 typedef 是 PFN_cuMemAlloc_v2000 。在 CUDA 3.2(3020 年)中引入的下一个版本的 typedef 是 PFN_cuMemAlloc_v3020

The typedefs can be used to more easily define a function pointer of the appropriate type in code:
typedefs 可以更轻松地在代码中定义适当类型的函数指针:

PFN_cuMemAlloc_v3020 pfn_cuMemAlloc_v2;
PFN_cuMemAlloc_v2000 pfn_cuMemAlloc_v1;

The above method is preferable if users are interested in a specific version of the API. Additionally, the headers have predefined macros for the latest version of all driver symbols that were available when the installed CUDA toolkit was released; these typedefs do not have a _v* suffix. For CUDA 11.3 toolkit, cuMemAlloc_v2 was the latest version and so we can also define its function pointer as below:
上述方法更可取,如果用户对特定版本的 API 感兴趣。此外,当安装的 CUDA 工具包发布时,标头中预定义了所有驱动符号的最新版本的宏;这些 typedef 不带 _v* 后缀。对于 CUDA 11.3 工具包, cuMemAlloc_v2 是最新版本,因此我们也可以定义其函数指针如下:

PFN_cuMemAlloc pfn_cuMemAlloc;

17.5.3. Driver Function Retrieval
17.5.3. 驱动程序函数检索 

Using the Driver Entry Point Access APIs and the appropriate typedef, we can get the function pointer to any CUDA driver API.
使用驱动程序入口点访问 API 和适当的 typedef,我们可以获取到任何 CUDA 驱动程序 API 的函数指针。

17.5.3.1. Using the driver API
17.5.3.1. 使用驱动程序 API 

The driver API requires CUDA version as an argument to get the ABI compatible version for the requested driver symbol. CUDA Driver APIs have a per-function ABI denoted with a _v* extension. For example, consider the versions of cuStreamBeginCapture and their corresponding typedefs from cudaTypedefs.h:
驱动程序 API 需要 CUDA 版本作为参数,以获取请求的驱动程序符号的 ABI 兼容版本。CUDA 驱动程序 API 具有以 _v* 扩展表示的每个函数 ABI。例如,考虑 cudaTypedefs.h 中的 cuStreamBeginCapture 版本及其对应的 typedefs

// cuda.h
CUresult CUDAAPI cuStreamBeginCapture(CUstream hStream);
CUresult CUDAAPI cuStreamBeginCapture_v2(CUstream hStream, CUstreamCaptureMode mode);

// cudaTypedefs.h
typedef CUresult (CUDAAPI *PFN_cuStreamBeginCapture_v10000)(CUstream hStream);
typedef CUresult (CUDAAPI *PFN_cuStreamBeginCapture_v10010)(CUstream hStream, CUstreamCaptureMode mode);

From the above typedefs in the code snippet, version suffixes _v10000 and _v10010 indicate that the above APIs were introduced in CUDA 10.0 and CUDA 10.1 respectively.
从上面的 typedefs 代码片段中,版本后缀 _v10000_v10010 表示上述 API 是在 CUDA 10.0 和 CUDA 10.1 中引入的。

#include <cudaTypedefs.h>

// Declare the entry points for cuStreamBeginCapture
PFN_cuStreamBeginCapture_v10000 pfn_cuStreamBeginCapture_v1;
PFN_cuStreamBeginCapture_v10010 pfn_cuStreamBeginCapture_v2;

// Get the function pointer to the cuStreamBeginCapture driver symbol
cuGetProcAddress("cuStreamBeginCapture", &pfn_cuStreamBeginCapture_v1, 10000, CU_GET_PROC_ADDRESS_DEFAULT, &driverStatus);
// Get the function pointer to the cuStreamBeginCapture_v2 driver symbol
cuGetProcAddress("cuStreamBeginCapture", &pfn_cuStreamBeginCapture_v2, 10010, CU_GET_PROC_ADDRESS_DEFAULT, &driverStatus);

Referring to the code snippet above, to retrieve the address to the _v1 version of the driver API cuStreamBeginCapture, the CUDA version argument should be exactly 10.0 (10000). Similarly, the CUDA version for retrieving the address to the _v2 version of the API should be 10.1 (10010). Specifying a higher CUDA version for retrieving a specific version of a driver API might not always be portable. For example, using 11030 here would still return the _v2 symbol, but if a hypothetical _v3 version is released in CUDA 11.3, the cuGetProcAddress API would start returning the newer _v3 symbol instead when paired with a CUDA 11.3 driver. Since the ABI and function signatures of the _v2 and _v3 symbols might differ, calling the _v3 function using the _v10010 typedef intended for the _v2 symbol would exhibit undefined behavior.
参考上面的代码片段,要检索驱动程序 API cuStreamBeginCapture_v1 版本的地址,CUDA 版本参数应该是精确的 10.0(10000)。类似地,用于检索 API _v2 版本地址的 CUDA 版本应该是 10.1(10010)。为了检索特定版本的驱动程序 API,指定更高的 CUDA 版本可能并不总是可移植的。例如,在这里使用 11030 仍然会返回 _v2 符号,但是如果在 CUDA 11.3 中发布了一个假设的 _v3 版本,当与 CUDA 11.3 驱动程序配对时, cuGetProcAddress API 将开始返回更新的 _v3 符号。由于 _v2_v3 符号的 ABI 和函数签名可能不同,使用为 _v2 符号而设计的 _v10010 typedef 调用 _v3 函数可能会表现出未定义的行为。

To retrieve the latest version of a driver API for a given CUDA Toolkit, we can also specify CUDA_VERSION as the version argument and use the unversioned typedef to define the function pointer. Since _v2 is the latest version of the driver API cuStreamBeginCapture in CUDA 11.3, the below code snippet shows a different method to retrieve it.
要检索给定 CUDA 工具包的驱动程序 API 的最新版本,我们还可以将 CUDA_VERSION 指定为 version 参数,并使用未版本化的 typedef 来定义函数指针。由于 _v2 是 CUDA 11.3 中驱动程序 API cuStreamBeginCapture 的最新版本,下面的代码片段显示了检索它的另一种方法。

// Assuming we are using CUDA 11.3 Toolkit

#include <cudaTypedefs.h>

// Declare the entry point
PFN_cuStreamBeginCapture pfn_cuStreamBeginCapture_latest;

// Intialize the entry point. Specifying CUDA_VERSION will give the function pointer to the
// cuStreamBeginCapture_v2 symbol since it is latest version on CUDA 11.3.
cuGetProcAddress("cuStreamBeginCapture", &pfn_cuStreamBeginCapture_latest, CUDA_VERSION, CU_GET_PROC_ADDRESS_DEFAULT, &driverStatus);

Note that requesting a driver API with an invalid CUDA version will return an error CUDA_ERROR_NOT_FOUND. In the above code examples, passing in a version less than 10000 (CUDA 10.0) would be invalid.
请注意,使用无效的 CUDA 版本请求驱动程序 API 将返回错误 CUDA_ERROR_NOT_FOUND 。在上面的代码示例中,传入小于 10000(CUDA 10.0)的版本将是无效的。

17.5.3.2. Using the runtime API
17.5.3.2. 使用运行时 API 

The runtime API cudaGetDriverEntryPoint uses the CUDA runtime version to get the ABI compatible version for the requested driver symbol. In the below code snippet, the minimum CUDA runtime version required would be CUDA 11.2 as cuMemAllocAsync was introduced then.
运行时 API cudaGetDriverEntryPoint 使用 CUDA 运行时版本来获取请求的驱动程序符号的 ABI 兼容版本。在下面的代码片段中,所需的最低 CUDA 运行时版本将是 CUDA 11.2,因为 cuMemAllocAsync 是在那时引入的。

#include <cudaTypedefs.h>

// Declare the entry point
PFN_cuMemAllocAsync pfn_cuMemAllocAsync;

// Intialize the entry point. Assuming CUDA runtime version >= 11.2
cudaGetDriverEntryPoint("cuMemAllocAsync", &pfn_cuMemAllocAsync, cudaEnableDefault, &driverStatus);

// Call the entry point
if(driverStatus == cudaDriverEntryPointSuccess && pfn_cuMemAllocAsync) {
    pfn_cuMemAllocAsync(...);
}

The runtime API cudaGetDriverEntryPointByVersion uses the user provided CUDA version to get the ABI compatible version for the requested driver symbol. This allows more specific control over the requested ABI version.
运行时 API cudaGetDriverEntryPointByVersion 使用用户提供的 CUDA 版本来获取请求的驱动程序符号的 ABI 兼容版本。这允许更具体地控制请求的 ABI 版本。

17.5.3.3. Retrieve per-thread default stream versions
17.5.3.3. 检索每个线程的默认流版本 

Some CUDA driver APIs can be configured to have default stream or per-thread default stream semantics. Driver APIs having per-thread default stream semantics are suffixed with _ptsz or _ptds in their name. For example, cuLaunchKernel has a per-thread default stream variant named cuLaunchKernel_ptsz. With the Driver Entry Point Access APIs, users can request for the per-thread default stream version of the driver API cuLaunchKernel instead of the default stream version. Configuring the CUDA driver APIs for default stream or per-thread default stream semantics affects the synchronization behavior. More details can be found here.
某些 CUDA 驱动程序 API 可以配置为具有默认流或每个线程的默认流语义。具有每个线程默认流语义的驱动程序 API 的名称以_ptsz 或_ptds 结尾。例如, cuLaunchKernel 有一个名为 cuLaunchKernel_ptsz 的每个线程默认流变体。使用驱动程序入口点访问 API,用户可以请求驱动程序 API cuLaunchKernel 的每个线程默认流版本,而不是默认流版本。配置 CUDA 驱动程序 API 以获得默认流或每个线程默认流语义会影响同步行为。更多详细信息请参阅此处。

The default stream or per-thread default stream versions of a driver API can be obtained by one of the following ways:
驱动程序 API 的默认流或每个线程的默认流版本可以通过以下方式之一获得:

  • Use the compilation flag --default-stream per-thread or define the macro CUDA_API_PER_THREAD_DEFAULT_STREAM to get per-thread default stream behavior.
    使用编译标志 --default-stream per-thread 或定义宏 CUDA_API_PER_THREAD_DEFAULT_STREAM 以获取每个线程的默认流行为。

  • Force default stream or per-thread default stream behavior using the flags CU_GET_PROC_ADDRESS_LEGACY_STREAM/cudaEnableLegacyStream or CU_GET_PROC_ADDRESS_PER_THREAD_DEFAULT_STREAM/cudaEnablePerThreadDefaultStream respectively.
    使用标志 CU_GET_PROC_ADDRESS_LEGACY_STREAM/cudaEnableLegacyStreamCU_GET_PROC_ADDRESS_PER_THREAD_DEFAULT_STREAM/cudaEnablePerThreadDefaultStream 分别强制默认流或每个线程的默认流行为。

17.5.3.4. Access new CUDA features
17.5.3.4. 访问新的 CUDA 功能 

It is always recommended to install the latest CUDA toolkit to access new CUDA driver features, but if for some reason, a user does not want to update or does not have access to the latest toolkit, the API can be used to access new CUDA features with only an updated CUDA driver. For discussion, let us assume the user is on CUDA 11.3 and wants to use a new driver API cuFoo available in the CUDA 12.0 driver. The below code snippet illustrates this use-case:
始终建议安装最新的 CUDA 工具包以访问新的 CUDA 驱动程序功能,但如果出于某种原因,用户不想更新或无法访问最新的工具包,则可以使用 API 仅使用更新的 CUDA 驱动程序访问新的 CUDA 功能。 为了讨论,让我们假设用户正在使用 CUDA 11.3 并希望使用 CUDA 12.0 驱动程序中可用的新驱动程序 API cuFoo 。 以下代码片段说明了这种用例:

int main()
{
    // Assuming we have CUDA 12.0 driver installed.

    // Manually define the prototype as cudaTypedefs.h in CUDA 11.3 does not have the cuFoo typedef
    typedef CUresult (CUDAAPI *PFN_cuFoo)(...);
    PFN_cuFoo pfn_cuFoo = NULL;
    CUdriverProcAddressQueryResult driverStatus;

    // Get the address for cuFoo API using cuGetProcAddress. Specify CUDA version as
    // 12000 since cuFoo was introduced then or get the driver version dynamically
    // using cuDriverGetVersion
    int driverVersion;
    cuDriverGetVersion(&driverVersion);
    CUresult status = cuGetProcAddress("cuFoo", &pfn_cuFoo, driverVersion, CU_GET_PROC_ADDRESS_DEFAULT, &driverStatus);

    if (status == CUDA_SUCCESS && pfn_cuFoo) {
        pfn_cuFoo(...);
    }
    else {
        printf("Cannot retrieve the address to cuFoo - driverStatus = %d. Check if the latest driver for CUDA 12.0 is installed.\n", driverStatus);
        assert(0);
    }

    // rest of code here

}

17.5.4. Potential Implications with cuGetProcAddress
17.5.4. 使用 cuGetProcAddress 可能会产生的影响 

Below is a set of concrete and theoretical examples of potential issues with cuGetProcAddress and cudaGetDriverEntryPoint.
以下是一组关于 cuGetProcAddresscudaGetDriverEntryPoint 可能问题的具体和理论示例。

17.5.4.1. Implications with cuGetProcAddress vs Implicit Linking
17.5.4.1. 使用 cuGetProcAddress 与隐式链接的影响 

cuDeviceGetUuid was introduced in CUDA 9.2. This API has a newer revision (cuDeviceGetUuid_v2) introduced in CUDA 11.4. To preserve minor version compatibility, cuDeviceGetUuid will not be version bumped to cuDeviceGetUuid_v2 in cuda.h until CUDA 12.0. This means that calling it by obtaining a function pointer to it via cuGetProcAddress might have different behavior. Example using the API directly:
cuDeviceGetUuid 在 CUDA 9.2 中引入。该 API 在 CUDA 11.4 中有一个更新版本( cuDeviceGetUuid_v2 )。为了保持次要版本的兼容性, cuDeviceGetUuid 在 cuda.h 中直到 CUDA 12.0 才会被升级到 cuDeviceGetUuid_v2 。这意味着通过 cuGetProcAddress 获取函数指针调用它可能会有不同的行为。直接使用该 API 的示例:

#include <cuda.h>

CUuuid uuid;
CUdevice dev;
CUresult status;

status = cuDeviceGet(&dev, 0); // Get device 0
// handle status

status = cuDeviceGetUuid(&uuid, dev) // Get uuid of device 0

In this example, assume the user is compiling with CUDA 11.4. Note that this will perform the behavior of cuDeviceGetUuid, not _v2 version. Now an example of using cuGetProcAddress:
在此示例中,假设用户正在使用 CUDA 11.4 进行编译。请注意,这将执行 cuDeviceGetUuid 的行为,而不是_v2 版本。现在是使用 cuGetProcAddress 的示例:

#include <cudaTypedefs.h>

CUuuid uuid;
CUdevice dev;
CUresult status;
CUdriverProcAddressQueryResult driverStatus;

status = cuDeviceGet(&dev, 0); // Get device 0
// handle status

PFN_cuDeviceGetUuid pfn_cuDeviceGetUuid;
status = cuGetProcAddress("cuDeviceGetUuid", &pfn_cuDeviceGetUuid, CUDA_VERSION, CU_GET_PROC_ADDRESS_DEFAULT, &driverStatus);
if(CUDA_SUCCESS == status && pfn_cuDeviceGetUuid) {
    // pfn_cuDeviceGetUuid points to ???
}

In this example, assume the user is compiling with CUDA 11.4. This will get the function pointer of cuDeviceGetUuid_v2. Calling the function pointer will then invoke the new _v2 function, not the same cuDeviceGetUuid as shown in the previous example.
在此示例中,假设用户正在使用 CUDA 11.4 进行编译。这将获取 cuDeviceGetUuid_v2 的函数指针。然后调用函数指针将调用新的 _v2 函数,而不是在先前示例中显示的相同的 cuDeviceGetUuid

17.5.4.2. Compile Time vs Runtime Version Usage in cuGetProcAddress
17.5.4.2. cuGetProcAddress 中的编译时与运行时版本使用

Let’s take the same issue and make one small tweak. The last example used the compile time constant of CUDA_VERSION to determine which function pointer to obtain. More complications arise if the user queries the driver version dynamically using cuDriverGetVersion or cudaDriverGetVersion to pass to cuGetProcAddress. Example:
让我们拿同样的问题来做一个小调整。上一个示例使用了编译时常量 CUDA_VERSION 来确定获取哪个函数指针。如果用户动态地查询驱动程序版本并使用 cuDriverGetVersioncudaDriverGetVersion 传递给 cuGetProcAddress ,则会出现更多复杂情况。示例:

#include <cudaTypedefs.h>

CUuuid uuid;
CUdevice dev;
CUresult status;
int cudaVersion;
CUdriverProcAddressQueryResult driverStatus;

status = cuDeviceGet(&dev, 0); // Get device 0
// handle status

status = cuDriverGetVersion(&cudaVersion);
// handle status

PFN_cuDeviceGetUuid pfn_cuDeviceGetUuid;
status = cuGetProcAddress("cuDeviceGetUuid", &pfn_cuDeviceGetUuid, cudaVersion, CU_GET_PROC_ADDRESS_DEFAULT, &driverStatus);
if(CUDA_SUCCESS == status && pfn_cuDeviceGetUuid) {
    // pfn_cuDeviceGetUuid points to ???
}

In this example, assume the user is compiling with CUDA 11.3. The user would debug, test, and deploy this application with the known behavior of getting cuDeviceGetUuid (not the _v2 version). Since CUDA has guaranteed ABI compatibility between minor versions, this same application is expected to run after the driver is upgraded to CUDA 11.4 (without updating the toolkit and runtime) without requiring recompilation. This will have undefined behavior though, because now the typedef for PFN_cuDeviceGetUuid will still be of the signature for the original version, but since cudaVersion would now be 11040 (CUDA 11.4), cuGetProcAddress would return the function pointer to the _v2 version, meaning calling it might have undefined behavior.
在这个示例中,假设用户正在使用 CUDA 11.3 进行编译。用户将使用已知行为获取 cuDeviceGetUuid (而不是 _v2 版本)来调试、测试和部署此应用程序。由于 CUDA 在次要版本之间保证了 ABI 兼容性,因此在将驱动程序升级到 CUDA 11.4 后(而无需更新工具包和运行时),预计该应用程序将继续运行而无需重新编译。尽管如此,这将导致未定义的行为,因为现在 PFN_cuDeviceGetUuid 的 typedef 仍将是原始版本的签名,但由于 cudaVersion 现在将是 11040(CUDA 11.4), cuGetProcAddress 将返回指向 _v2 版本的函数指针,这意味着调用它可能会导致未定义的行为。

Note in this case the original (not the _v2 version) typedef looks like:
请注意,在这种情况下,原始(而非 _v2 版本)的 typedef 如下所示:

typedef CUresult (CUDAAPI *PFN_cuDeviceGetUuid_v9020)(CUuuid *uuid, CUdevice_v1 dev);

But the _v2 version typedef looks like:
但_v2 版本的 typedef 看起来像:

typedef CUresult (CUDAAPI *PFN_cuDeviceGetUuid_v11040)(CUuuid *uuid, CUdevice_v1 dev);

So in this case, the API/ABI is going to be the same and the runtime API call will likely not cause issues–only the potential for unknown uuid return. In Implications to API/ABI, we discuss a more problematic case of API/ABI compatibility.
因此,在这种情况下,API/ABI 将保持不变,运行时 API 调用可能不会引起问题——只有未知 uuid 返回的潜在可能性。在 API/ABI 的影响中,我们讨论了一个更为棘手的 API/ABI 兼容性案例。

17.5.4.3. API Version Bumps with Explicit Version Checks
17.5.4.3. 使用显式版本检查进行 API 版本升级 

Above, was a specific concrete example. Now for instance let’s use a theoretical example that still has issues with compatibility across driver versions. Example:
以上是一个具体的具体示例。现在,例如,让我们使用一个理论示例,该示例仍然存在与驱动程序版本兼容性的问题。示例:

CUresult cuFoo(int bar); // Introduced in CUDA 11.4
CUresult cuFoo_v2(int bar); // Introduced in CUDA 11.5
CUresult cuFoo_v3(int bar, void* jazz); // Introduced in CUDA 11.6

typedef CUresult (CUDAAPI *PFN_cuFoo_v11040)(int bar);
typedef CUresult (CUDAAPI *PFN_cuFoo_v11050)(int bar);
typedef CUresult (CUDAAPI *PFN_cuFoo_v11060)(int bar, void* jazz);

Notice that the API has been modified twice since original creation in CUDA 11.4 and the latest in CUDA 11.6 also modified the API/ABI interface to the function. The usage in user code compiled against CUDA 11.5 is:
请注意,自 CUDA 11.4 创建以来,API 已经修改了两次,最新的 CUDA 11.6 也修改了函数的 API/ABI 接口。用户代码中针对 CUDA 11.5 编译的用法是:

#include <cuda.h>
#include <cudaTypedefs.h>

CUresult status;
int cudaVersion;
CUdriverProcAddressQueryResult driverStatus;

status = cuDriverGetVersion(&cudaVersion);
// handle status

PFN_cuFoo_v11040 pfn_cuFoo_v11040;
PFN_cuFoo_v11050 pfn_cuFoo_v11050;
if(cudaVersion < 11050 ) {
    // We know to get the CUDA 11.4 version
    status = cuGetProcAddress("cuFoo", &pfn_cuFoo_v11040, cudaVersion, CU_GET_PROC_ADDRESS_DEFAULT, &driverStatus);
    // Handle status and validating pfn_cuFoo_v11040
}
else {
    // Assume >= CUDA 11.5 version we can use the second version
    status = cuGetProcAddress("cuFoo", &pfn_cuFoo_v11050, cudaVersion, CU_GET_PROC_ADDRESS_DEFAULT, &driverStatus);
    // Handle status and validating pfn_cuFoo_v11050
}

In this example, without updates for the new typedef in CUDA 11.6 and recompiling the application with those new typedefs and case handling, the application will get the cuFoo_v3 function pointer returned and any usage of that function would then cause undefined behavior. The point of this example was to illustrate that even explicit version checks for cuGetProcAddress may not safely cover the minor version bumps within a CUDA major release.
在这个示例中,如果不更新到 CUDA 11.6 中的新 typedef,并使用这些新 typedef 和 case 处理重新编译应用程序,应用程序将返回 cuFoo_v3 函数指针,并且对该函数的任何使用都会导致未定义行为。这个示例的重点是说明,即使对 cuGetProcAddress 进行了显式版本检查,也可能无法安全地覆盖 CUDA 主要版本中的次要版本增加。

17.5.4.4. Issues with Runtime API Usage
17.5.4.4. 运行时 API 使用问题 

The above examples were focused on the issues with the Driver API usage for obtaining the function pointers to driver APIs. Now we will discuss the potential issues with the Runtime API usage for cudaApiGetDriverEntryPoint.
上述示例侧重于使用 Driver API 获取驱动程序 API 的函数指针时遇到的问题。现在我们将讨论使用 Runtime API 获取 cudaApiGetDriverEntryPoint 时可能遇到的潜在问题。

We will start by using the Runtime APIs similar to the above.
我们将从使用类似于上面的运行时 API 开始。

#include <cuda.h>
#include <cudaTypedefs.h>
#include <cuda_runtime.h>

CUresult status;
cudaError_t error;
int driverVersion, runtimeVersion;
CUdriverProcAddressQueryResult driverStatus;

// Ask the runtime for the function
PFN_cuDeviceGetUuid pfn_cuDeviceGetUuidRuntime;
error = cudaGetDriverEntryPoint ("cuDeviceGetUuid", &pfn_cuDeviceGetUuidRuntime, cudaEnableDefault, &driverStatus);
if(cudaSuccess == error && pfn_cuDeviceGetUuidRuntime) {
    // pfn_cuDeviceGetUuid points to ???
}

The function pointer in this example is even more complicated than the driver only examples above because there is no control over which version of the function to obtain; it will always get the API for the current CUDA Runtime version. See the following table for more information:
在这个示例中,函数指针比上面仅有驱动程序示例更复杂,因为无法控制获取哪个版本的函数;它将始终获取当前 CUDA 运行时版本的 API。有关更多信息,请参阅以下表格:

Static Runtime Version Linkage
静态运行时版本链接

Driver Version Installed 已安装的驱动程序版本

V11.3

V11.4

V11.3

v1

v1x

V11.4

v1

v2

V11.3 => 11.3 CUDA Runtime and Toolkit (includes header files cuda.h and cudaTypedefs.h)
V11.4 => 11.4 CUDA Runtime and Toolkit (includes header files cuda.h and cudaTypedefs.h)
v1 => cuDeviceGetUuid
v2 => cuDeviceGetUuid_v2

x => Implies the typedef function pointer won't match the returned
     function pointer.  In these cases, the typedef at compile time
     using a CUDA 11.4 runtime, would match the _v2 version, but the
     returned function pointer would be the original (non _v2) function.

The problem in the table comes in with a newer CUDA 11.4 Runtime and Toolkit and older driver (CUDA 11.3) combination, labeled as v1x in the above. This combination would have the driver returning the pointer to the older function (non _v2), but the typedef used in the application would be for the new function pointer.
表中的问题出现在较新的 CUDA 11.4 运行时和工具包与较旧的驱动程序(CUDA 11.3)组合中,如上所示标记为 v1x。 这种组合会导致驱动程序返回指向较旧函数(非 _v2)的指针,但应用程序中使用的 typedef 是用于新函数指针的。

17.5.4.5. Issues with Runtime API and Dynamic Versioning
17.5.4.5. 运行时 API 和动态版本控制的问题 

More complications arise when we consider different combinations of the CUDA version with which an application is compiled, CUDA runtime version, and CUDA driver version that an application dynamically links against.
当我们考虑应用程序编译时使用的不同 CUDA 版本、CUDA 运行时版本以及应用程序动态链接的 CUDA 驱动程序版本的不同组合时,会出现更多的复杂情况。

#include <cuda.h>
#include <cudaTypedefs.h>
#include <cuda_runtime.h>

CUresult status;
cudaError_t error;
int driverVersion, runtimeVersion;
CUdriverProcAddressQueryResult driverStatus;
enum cudaDriverEntryPointQueryResult runtimeStatus;

PFN_cuDeviceGetUuid pfn_cuDeviceGetUuidDriver;
status = cuGetProcAddress("cuDeviceGetUuid", &pfn_cuDeviceGetUuidDriver, CUDA_VERSION, CU_GET_PROC_ADDRESS_DEFAULT, &driverStatus);
if(CUDA_SUCCESS == status && pfn_cuDeviceGetUuidDriver) {
    // pfn_cuDeviceGetUuidDriver points to ???
}

// Ask the runtime for the function
PFN_cuDeviceGetUuid pfn_cuDeviceGetUuidRuntime;
error = cudaGetDriverEntryPoint ("cuDeviceGetUuid", &pfn_cuDeviceGetUuidRuntime, cudaEnableDefault, &runtimeStatus);
if(cudaSuccess == error && pfn_cuDeviceGetUuidRuntime) {
    // pfn_cuDeviceGetUuidRuntime points to ???
}

// Ask the driver for the function based on the driver version (obtained via runtime)
error = cudaDriverGetVersion(&driverVersion);
PFN_cuDeviceGetUuid pfn_cuDeviceGetUuidDriverDriverVer;
status = cuGetProcAddress ("cuDeviceGetUuid", &pfn_cuDeviceGetUuidDriverDriverVer, driverVersion, CU_GET_PROC_ADDRESS_DEFAULT, &driverStatus);
if(CUDA_SUCCESS == status && pfn_cuDeviceGetUuidDriverDriverVer) {
    // pfn_cuDeviceGetUuidDriverDriverVer points to ???
}

The following matrix of function pointers is expected:
预期的函数指针矩阵如下:

Function Pointer 函数指针

Application Compiled/Runtime Dynamic Linked Version/Driver Version
应用程序编译/运行时动态链接版本/驱动程序版本

(3 => CUDA 11.3 and 4 => CUDA 11.4)
(3 => CUDA 11.3 和 4 => CUDA 11.4)

3/3/3

3/3/4

3/4/3

3/4/4

4/3/3

4/3/4

4/4/3

4/4/4

pfn_cuDeviceGetUuidDriver

t1/v1

t1/v1

t1/v1

t1/v1

N/A

N/A

t2/v1

t2/v2

pfn_cuDeviceGetUuidRuntime

t1/v1

t1/v1

t1/v1

t1/v2

N/A

N/A

t2/v1

t2/v2

pfn_cuDeviceGetUuidDriverDriverVer

t1/v1

t1/v2

t1/v1

t1/v2

N/A

N/A

t2/v1

t2/v2

tX -> Typedef version used at compile time
vX -> Version returned/used at runtime

If the application is compiled against CUDA Version 11.3, it would have the typedef for the original function, but if compiled against CUDA Version 11.4, it would have the typedef for the _v2 function. Because of that, notice the number of cases where the typedef does not match the actual version returned/used.
如果应用程序针对 CUDA 版本 11.3 进行编译,它将具有原始函数的 typedef,但如果针对 CUDA 版本 11.4 进行编译,它将具有 _v2 函数的 typedef。因此,请注意 typedef 与实际返回/使用的版本不匹配的情况数量。

17.5.4.6. Issues with Runtime API allowing CUDA Version
17.5.4.6. 运行时 API 允许 CUDA 版本的问题 

Unless specified otherwise, the CUDA runtime API cudaGetDriverEntryPointByVersion will have similar implications as the driver entry point cuGetProcAddress since it allows for the user to request a specific CUDA driver version.
除非另有规定,否则 CUDA 运行时 API cudaGetDriverEntryPointByVersion 将具有与驱动程序入口点 cuGetProcAddress 类似的影响,因为它允许用户请求特定的 CUDA 驱动程序版本。

17.5.4.7. Implications to API/ABI
17.5.4.7. API/ABI 的影响 

In the above examples using cuDeviceGetUuid, the implications of the mismatched API are minimal, and may not be entirely noticeable to many users as the _v2 was added to support Multi-Instance GPU (MIG) mode. So, on a system without MIG, the user might not even realize they are getting a different API.
在上面的示例中使用 cuDeviceGetUuid ,不匹配的 API 的影响很小,对许多用户可能并不完全明显,因为添加了_v2 以支持多实例 GPU(MIG)模式。因此,在没有 MIG 的系统上,用户甚至可能意识不到他们正在使用不同的 API。

More problematic is an API which changes its application signature (and hence ABI) such as cuCtxCreate. The _v2 version, introduced in CUDA 3.2 is currently used as the default cuCtxCreate when using cuda.h but now has a newer version introduced in CUDA 11.4 (cuCtxCreate_v3). The API signature has been modified as well, and now takes extra arguments. So, in some of the cases above, where the typedef to the function pointer doesn’t match the returned function pointer, there is a chance for non-obvious ABI incompatibility which would lead to undefined behavior.
更为棘手的是 API 改变其应用程序签名(因此 ABI 也会改变),例如 cuCtxCreate 。_v2 版本是在 CUDA 3.2 中引入的,目前在使用 cuda.h 时作为默认 cuCtxCreate ,但现在在 CUDA 11.4 中引入了更新版本( cuCtxCreate_v3 )。API 签名也已经被修改,现在需要额外的参数。因此,在上述某些情况下,如果函数指针的 typedef 与返回的函数指针不匹配,就有可能出现不明显的 ABI 不兼容性,这将导致未定义的行为。

For example, assume the following code compiled against a CUDA 11.3 toolkit with a CUDA 11.4 driver installed:
例如,假设以下代码针对安装了 CUDA 11.4 驱动程序的 CUDA 11.3 工具包进行编译:

PFN_cuCtxCreate cuUnknown;
CUdriverProcAddressQueryResult driverStatus;

status = cuGetProcAddress("cuCtxCreate", (void**)&cuUnknown, cudaVersion, CU_GET_PROC_ADDRESS_DEFAULT, &driverStatus);
if(CUDA_SUCCESS == status && cuUnknown) {
    status = cuUnknown(&ctx, 0, dev);
}

Running this code where cudaVersion is set to anything >=11040 (indicating CUDA 11.4) could have undefined behavior due to not having adequately supplied all the parameters required for the _v3 version of the cuCtxCreate_v3 API.
在运行此代码时,其中 cudaVersion 设置为>=11040(表示 CUDA 11.4),可能由于未充分提供所需参数而导致 cuCtxCreate_v3 API 的_v3 版本出现未定义行为。

17.5.5. Determining cuGetProcAddress Failure Reasons
17.5.5. 确定 cuGetProcAddress 失败原因 

There are two types of errors with cuGetProcAddress. Those are (1) API/usage errors and (2) inability to find the driver API requested. The first error type will return error codes from the API via the CUresult return value. Things like passing NULL as the pfn variable or passing invalid flags.
cuGetProcAddress 有两种错误类型。这些是(1) API/使用错误和(2) 无法找到请求的驱动程序 API。第一种错误类型将通过 CUresult 返回值从 API 返回错误代码。诸如将 NULL 作为 pfn 变量传递或传递无效的 flags

The second error type encodes in the CUdriverProcAddressQueryResult *symbolStatus and can be used to help distinguish potential issues with the driver not being able to find the symbol requested. Take the following example:
第二种错误类型编码在 CUdriverProcAddressQueryResult *symbolStatus 中,可用于帮助区分驱动程序无法找到请求的符号的潜在问题。看下面的例子:

// cuDeviceGetExecAffinitySupport was introduced in release CUDA 11.4
#include <cuda.h>
CUdriverProcAddressQueryResult driverStatus;
cudaVersion = ...;
status = cuGetProcAddress("cuDeviceGetExecAffinitySupport", &pfn, cudaVersion, 0, &driverStatus);
if (CUDA_SUCCESS == status) {
    if (CU_GET_PROC_ADDRESS_VERSION_NOT_SUFFICIENT == driverStatus) {
        printf("We can use the new feature when you upgrade cudaVersion to 11.4, but CUDA driver is good to go!\n");
        // Indicating cudaVersion was < 11.4 but run against a CUDA driver >= 11.4
    }
    else if (CU_GET_PROC_ADDRESS_SYMBOL_NOT_FOUND == driverStatus) {
        printf("Please update both CUDA driver and cudaVersion to at least 11.4 to use the new feature!\n");
        // Indicating driver is < 11.4 since string not found, doesn't matter what cudaVersion was
    }
    else if (CU_GET_PROC_ADDRESS_SUCCESS == driverStatus && pfn) {
        printf("You're using cudaVersion and CUDA driver >= 11.4, using new feature!\n");
        pfn();
    }
}

The first case with the return code CU_GET_PROC_ADDRESS_VERSION_NOT_SUFFICIENT indicates that the symbol was found when searching in the CUDA driver but it was added later than the cudaVersion supplied. In the example, specifying cudaVersion as anything 11030 or less and when running against a CUDA driver >= CUDA 11.4 would give this result of CU_GET_PROC_ADDRESS_VERSION_NOT_SUFFICIENT. This is because cuDeviceGetExecAffinitySupport was added in CUDA 11.4 (11040).
第一个带有返回代码 CU_GET_PROC_ADDRESS_VERSION_NOT_SUFFICIENT 的情况表明,在 CUDA 驱动程序中搜索时找到了 symbol ,但是它比提供的 cudaVersion 晚添加。在这个示例中,将 cudaVersion 指定为任何小于或等于 11030 的值,并且针对 CUDA 驱动程序>= CUDA 11.4 运行将得到 CU_GET_PROC_ADDRESS_VERSION_NOT_SUFFICIENT 的结果。这是因为 cuDeviceGetExecAffinitySupport 是在 CUDA 11.4(11040)中添加的。

The second case with the return code CU_GET_PROC_ADDRESS_SYMBOL_NOT_FOUND indicates that the symbol was not found when searching in the CUDA driver. This can be due to a few reasons such as unsupported CUDA function due to older driver as well as just having a typo. In the latter, similar to the last example if the user had put symbol as CUDeviceGetExecAffinitySupport - notice the capital CU to start the string - cuGetProcAddress would not be able to find the API because the string doesn’t match. In the former case an example might be the user developing an application against a CUDA driver supporting the new API, and deploying the application against an older CUDA driver. Using the last example, if the developer developed against CUDA 11.4 or later but was deployed against a CUDA 11.3 driver, during their development they may have had a succesful cuGetProcAddress, but when deploying an application running against a CUDA 11.3 driver the call would no longer work with the CU_GET_PROC_ADDRESS_SYMBOL_NOT_FOUND returned in driverStatus.
第二个带有返回代码 CU_GET_PROC_ADDRESS_SYMBOL_NOT_FOUND 的情况表明,在 CUDA 驱动程序中搜索时未找到 symbol 。这可能是由于几个原因,比如由于较旧的驱动程序而导致不支持 CUDA 函数,或者只是拼写错误。在后一种情况下,类似于上一个示例,如果用户将 symbol 放在 CUDeviceGetExecAffinitySupport - 注意以大写 CU 开头的字符串 - cuGetProcAddress ,API 将无法找到,因为字符串不匹配。在前一种情况下,一个示例可能是用户针对支持新 API 的 CUDA 驱动程序开发应用程序,并将应用程序部署到较旧的 CUDA 驱动程序上。使用上一个示例,如果开发人员针对 CUDA 11.4 或更高版本进行开发,但部署到 CUDA 11.3 驱动程序上,在开发过程中可能会有一个成功的 cuGetProcAddress ,但在部署到运行在 CUDA 11.3 驱动程序上的应用程序时,调用将不再使用 CU_GET_PROC_ADDRESS_SYMBOL_NOT_FOUND 返回 driverStatus

18. CUDA Environment Variables
18. CUDA 环境变量 

The following table lists the CUDA environment variables. Environment variables related to the Multi-Process Service are documented in the Multi-Process Service section of the GPU Deployment and Management guide.
以下表格列出了 CUDA 环境变量。与多进程服务相关的环境变量在 GPU 部署和管理指南的多进程服务部分中有记录。

Table 24 CUDA Environment Variables
表 24 CUDA 环境变量 

Variable 变量

Values 数值

Description 描述

Device Enumeration and Properties
设备枚举和属性

CUDA_VISIBLE_DEVICES

A comma-separated sequence of GPU identifiers MIG support: MIG-<GPU-UUID>/<GPU instance ID>/<compute instance ID>
GPU 标识符 MIG 支持的逗号分隔序列: MIG-<GPU-UUID>/<GPU instance ID>/<compute instance ID>

GPU identifiers are given as integer indices or as UUID strings. GPU UUID strings should follow the same format as given by nvidia-smi, such as GPU-8932f937-d72c-4106-c12f-20bd9faed9f6. However, for convenience, abbreviated forms are allowed; simply specify enough digits from the beginning of the GPU UUID to uniquely identify that GPU in the target system. For example, CUDA_VISIBLE_DEVICES=GPU-8932f937 may be a valid way to refer to the above GPU UUID, assuming no other GPU in the system shares this prefix. Only the devices whose index is present in the sequence are visible to CUDA applications and they are enumerated in the order of the sequence. If one of the indices is invalid, only the devices whose index precedes the invalid index are visible to CUDA applications. For example, setting CUDA_VISIBLE_DEVICES to 2,1 causes device 0 to be invisible and device 2 to be enumerated before device 1. Setting CUDA_VISIBLE_DEVICES to 0,2,-1,1 causes devices 0 and 2 to be visible and device 1 to be invisible. MIG format starts with MIG keyword and GPU UUID should follow the same format as given by nvidia-smi. For example, MIG-GPU-8932f937-d72c-4106-c12f-20bd9faed9f6/1/2. Only single MIG instance enumeration is supported.
GPU 标识符可表示为整数索引或 UUID 字符串。 GPU UUID 字符串应遵循 nvidia-smi 给出的相同格式,例如 GPU-8932f937-d72c-4106-c12f-20bd9faed9f6。 但是,为方便起见,允许使用缩写形式;只需指定足够的数字从 GPU UUID 的开头以唯一标识目标系统中的 GPU。 例如,CUDA_VISIBLE_DEVICES = GPU-8932f937 可能是引用上述 GPU UUID 的有效方式,假设系统中没有其他 GPU 共享此前缀。 仅对序列中存在索引的设备对 CUDA 应用程序可见,并且它们按序列的顺序枚举。 如果其中一个索引无效,则仅对索引在无效索引之前的设备对 CUDA 应用程序可见。 例如,将 CUDA_VISIBLE_DEVICES 设置为 2,1 会导致设备 0 不可见,并且设备 2 在设备 1 之前枚举。 将 CUDA_VISIBLE_DEVICES 设置为 0,2,-1,1 会导致设备 0 和 2 可见,设备 1 不可见。 MIG 格式以 MIG 关键字开头,GPU UUID 应遵循 nvidia-smi 给出的相同格式。 例如,MIG-GPU-8932f937-d72c-4106-c12f-20bd9faed9f6/1/2。 仅支持单个 MIG 实例枚举。

CUDA_MANAGED_FORCE_DEVICE_ALLOC

0 or 1 (default is 0)
0 或 1(默认为 0)

Forces the driver to place all managed allocations in device memory.
强制驱动程序将所有托管分配放置在设备内存中。

CUDA_DEVICE_ORDER

FASTEST_FIRST, PCI_BUS_ID, (default is FASTEST_FIRST)
FASTEST_FIRST,PCI_BUS_ID,(默认为 FASTEST_FIRST)

FASTEST_FIRST causes CUDA to enumerate the available devices in fastest to slowest order using a simple heuristic. PCI_BUS_ID orders devices by PCI bus ID in ascending order.
FASTEST_FIRST 导致 CUDA 使用简单的启发式方法按照从快到慢的顺序枚举可用设备。PCI_BUS_ID 按照升序的 PCI 总线 ID 对设备进行排序。

Compilation 编译

CUDA_CACHE_DISABLE

0 or 1 (default is 0)
0 或 1(默认为 0)

Disables caching (when set to 1) or enables caching (when set to 0) for just-in-time-compilation. When disabled, no binary code is added to or retrieved from the cache.
禁用缓存(设置为 1 时)或启用缓存(设置为 0 时)以进行即时编译。禁用时,不会将任何二进制代码添加到缓存中或从缓存中检索。

CUDA_CACHE_PATH

filepath 文件路径

Specifies the folder where the just-in-time compiler caches binary codes; the default values are:
指定即时编译器缓存二进制代码的文件夹;默认值为:

  • on Windows, %APPDATA%\NVIDIA\ComputeCache 在 Windows 上, %APPDATA%\NVIDIA\ComputeCache

  • on Linux, ~/.nv/ComputeCache 在 Linux 上, ~/.nv/ComputeCache

CUDA_CACHE_MAXSIZE

integer (default is 1073741824 (1 GiB) for desktop/server platforms and 268435456 (256 MiB) for embedded platforms and the maximum is 4294967296 (4 GiB))
整数(默认值为 1073741824(1 GiB)用于桌面/服务器平台和 268435456(256 MiB)用于嵌入式平台,最大值为 4294967296(4 GiB))

Specifies the size in bytes of the cache used by the just-in-time compiler. Binary codes whose size exceeds the cache size are not cached. Older binary codes are evicted from the cache to make room for newer binary codes if needed.
指定即时编译器使用的缓存大小(以字节为单位)。大小超过缓存大小的二进制代码不会被缓存。如果需要,较旧的二进制代码将从缓存中清除,以为较新的二进制代码腾出空间。

CUDA_FORCE_PTX_JIT

0 or 1 (default is 0)
0 或 1(默认为 0)

When set to 1, forces the device driver to ignore any binary code embedded in an application (see Application Compatibility) and to just-in-time compile embedded PTX code instead. If a kernel does not have embedded PTX code, it will fail to load. This environment variable can be used to validate that PTX code is embedded in an application and that its just-in-time compilation works as expected to guarantee application forward compatibility with future architectures (see Just-in-Time Compilation).
当设置为 1 时,强制设备驱动程序忽略应用程序中嵌入的任何二进制代码(请参阅应用程序兼容性),并改为即时编译嵌入的 PTX 代码。如果内核没有嵌入的 PTX 代码,则加载将失败。此环境变量可用于验证 PTX 代码是否嵌入应用程序,并确保其即时编译按预期工作,以保证应用程序与未来架构的向前兼容性(请参阅即时编译)。

CUDA_DISABLE_PTX_JIT

0 or 1 (default is 0)
0 或 1(默认为 0)

When set to 1, disables the just-in-time compilation of embedded PTX code and use the compatible binary code embedded in an application (see Application Compatibility). If a kernel does not have embedded binary code or the embedded binary was compiled for an incompatible architecture, then it will fail to load. This environment variable can be used to validate that an application has the compatible SASS code generated for each kernel.(see Binary Compatibility).
当设置为 1 时,禁用嵌入式 PTX 代码的即时编译,并使用嵌入在应用程序中的兼容二进制代码(请参阅应用程序兼容性)。如果一个内核没有嵌入的二进制代码,或者嵌入的二进制代码是为不兼容的架构编译的,则加载将失败。此环境变量可用于验证应用程序为每个内核生成了兼容的 SASS 代码(请参阅二进制兼容性)。

CUDA_FORCE_JIT

0 or 1 (default is 0)
0 或 1(默认为 0)

When set to 1, forces the device driver to ignore any binary code embedded in an application (see Application Compatibility) and to just-in-time compile embedded PTX code instead. If a kernel does not have embedded PTX code, it will fail to load. This environment variable can be used to validate that PTX code is embedded in an application and that its just-in-time compilation works as expected to guarantee application forward compatibility with future architectures (see Just-in-Time Compilation). The behavior can be overridden for embedded PTX by setting CUDA_FORCE_PTX_JIT=0.
当设置为 1 时,强制设备驱动程序忽略应用程序中嵌入的任何二进制代码(请参阅应用程序兼容性),并改为即时编译嵌入的 PTX 代码。如果内核没有嵌入的 PTX 代码,则加载将失败。此环境变量可用于验证应用程序中是否嵌入了 PTX 代码,并且其即时编译是否按预期工作,以确保应用程序与未来架构的前向兼容性(请参阅即时编译)。可以通过设置 CUDA_FORCE_PTX_JIT=0 来覆盖嵌入的 PTX 的行为。

CUDA_DISABLE_JIT

0 or 1 (default is 0)
0 或 1(默认为 0)

When set to 1, disables the just-in-time compilation of embedded PTX code and use the compatible binary code embedded in an application (see Application Compatibility). If a kernel does not have embedded binary code or the embedded binary was compiled for an incompatible architecture, then it will fail to load. This environment variable can be used to validate that an application has the compatible SASS code generated for each kernel.(see Binary Compatibility). The behavior can be overridden for embedded PTX by setting CUDA_DISABLE_PTX_JIT=0.
当设置为 1 时,禁用嵌入式 PTX 代码的即时编译,并使用应用程序中嵌入的兼容二进制代码(请参阅应用程序兼容性)。如果内核没有嵌入的二进制代码,或者嵌入的二进制代码是为不兼容的架构编译的,则将无法加载。此环境变量可用于验证应用程序是否为每个内核生成了兼容的 SASS 代码(请参阅二进制兼容性)。可以通过设置 CUDA_DISABLE_PTX_JIT=0 来覆盖嵌入式 PTX 的行为。

Execution 执行

CUDA_LAUNCH_BLOCKING

0 or 1 (default is 0)
0 或 1(默认为 0)

Disables (when set to 1) or enables (when set to 0) asynchronous kernel launches.
禁用(设置为 1 时)或启用(设置为 0 时)异步内核启动。

CUDA_DEVICE_MAX_CONNECTIONS

1 to 32 (default is 8)
1 到 32(默认为 8)

Sets the number of compute and copy engine concurrent connections (work queues) from the host to each device of compute capability 3.5 and above.
设置主机到每个计算能力为 3.5 及以上的设备的计算和复制引擎并发连接(工作队列)数量。

CUDA_AUTO_BOOST

0 or 1 0 或 1

Overrides the autoboost behavior set by the –auto-boost-default option of nvidia-smi. If an application requests via this environment variable a behavior that is different from nvidia-smi’s, its request is honored if there is no other application currently running on the same GPU that successfully requested a different behavior, otherwise it is ignored.
覆盖由 nvidia-smi 的 --auto-boost-default 选项设置的自动增强行为。如果应用程序通过此环境变量请求与 nvidia-smi 不同的行为,则如果同一 GPU 上当前没有其他应用程序成功请求了不同的行为,则会尊重其请求,否则将被忽略。

CUDA_SCALE_LAUNCH_QUEUES

“0.25x”, “0.5x”, “2x” or “4x”
“0.25x”, “0.5x”, “2x” 或 “4x”

Scales the size of the queues available for launching work by a fixed multiplier.
通过固定倍数扩展可用于启动工作的队列大小。

cuda-gdb (on Linux platform)
cuda-gdb(在 Linux 平台上)

CUDA_DEVICE_WAITS_ON_EXCEPTION

0 or 1 (default is 0)
0 或 1(默认为 0)

When set to 1, a CUDA application will halt when a device exception occurs, allowing a debugger to be attached for further debugging.
当设置为 1 时,CUDA 应用程序将在设备异常发生时暂停,允许附加调试器进行进一步调试。

MPS service (on Linux platform)
MPS 服务(在 Linux 平台上)

CUDA_DEVICE_DEFAULT_PERSISTING_L2_CACHE_PERCENTAGE_LIMIT

Percentage value (between 0 - 100, default is 0)
百分比值(介于 0 - 100 之间,默认为 0)

Devices of compute capability 8.x allow, a portion of L2 cache to be set-aside for persisting data accesses to global memory. When using CUDA MPS service, the set-aside size can only be controlled using this environment variable, before starting CUDA MPS control daemon. I.e., the environment variable should be set before running the command nvidia-cuda-mps-control -d.
计算能力为 8.x 的设备允许将部分 L2 缓存设置为持久化数据访问全局内存。在使用 CUDA MPS 服务时,只能使用此环境变量控制设置的大小,在启动 CUDA MPS 控制守护程序之前。即,在运行命令 nvidia-cuda-mps-control -d 之前应设置环境变量。

Module loading 模块加载

CUDA_MODULE_LOADING

DEFAULT, LAZY, EAGER (default is LAZY)
DEFAULT,LAZY,EAGER(默认为 LAZY)

Specifies the module loading mode for the application. When set to EAGER, all kernels and data from a cubin, fatbin or a PTX file are fully loaded upon corresponding cuModuleLoad* and cuLibraryLoad* API call. When set to LAZY, loading of specific kernels is delayed to the point a CUfunc handle is extracted with cuModuleGetFunction or cuKernelGetFunction API calls and data from the cubin is loaded at load of first kernel in the cubin or at first access of variables in the cubin. Default behavior may change in future CUDA releases.
指定应用程序的模块加载模式。当设置为 EAGER 时,所有来自 cubin、fatbin 或 PTX 文件的内核和数据在相应的 cuModuleLoad*cuLibraryLoad* API 调用时完全加载。当设置为 LAZY 时,特定内核的加载被延迟到提取 CUfunc 句柄时,使用 cuModuleGetFunctioncuKernelGetFunction API 调用,并且来自 cubin 的数据在 cubin 中的第一个内核加载或在 cubin 中的变量第一次访问时加载。默认行为可能会在未来的 CUDA 发行版中更改。

CUDA_MODULE_DATA_LOADING

DEFAULT, LAZY, EAGER (default is LAZY)
DEFAULT,LAZY,EAGER(默认为 LAZY)

Specifies the data loading mode for the application. When set to EAGER, all data from a cubin, fatbin or a PTX file are fully loaded to memory upon corresponding cuLibraryLoad*. This doesn’t affect the LAZY or EAGER loading of kernels. When set to LAZY, loading of data is delayed to the point at which a handle is required. Default behavior may change in future CUDA releases. Data loading behavior is inherited from CUDA_MODULE_LOADING if this environment variable is not set.
指定应用程序的数据加载模式。当设置为 EAGER 时,来自 cubin、fatbin 或 PTX 文件的所有数据将在相应 cuLibraryLoad* 时完全加载到内存中。这不会影响内核的 LAZY 或 EAGER 加载。当设置为 LAZY 时,数据加载被延迟到需要句柄的时候。默认行为可能会在未来的 CUDA 发布中更改。如果未设置此环境变量,则数据加载行为将继承自 CUDA_MODULE_LOADING

Pre-loading dependent libraries
预加载依赖库

CUDA_FORCE_PRELOAD_LIBRARIES

0 or 1 (default is 0)
0 或 1(默认为 0)

When set to 1, forces the driver to preload the libraries required for NVVM and PTX just-in-time compilation during driver initialization. This will increase the memory footprint and the time taken for CUDA driver initialization. This environment variable needs to be set to avoid certain deadlock situations involving multiple CUDA threads.
当设置为 1 时,强制驱动程序在驱动程序初始化期间预加载 NVVM 和 PTX 所需的库,以进行即时编译。这将增加内存占用和 CUDA 驱动程序初始化所需的时间。需要设置此环境变量以避免涉及多个 CUDA 线程的某些死锁情况。

CUDA Graphs CUDA 图形

CUDA_GRAPHS_USE_NODE_PRIORITY

0 or 1 0 或 1

Overrides the cudaGraphInstantiateFlagUseNodePriority flag on graph instantiation. When set to 1, the flag will be set for all graphs and when set to 0, the flag will be cleared for all graphs.
在图实例化时覆盖 cudaGraphInstantiateFlagUseNodePriority 标志。当设置为 1 时,该标志将应用于所有图形;当设置为 0 时,该标志将被清除于所有图形。

19. Unified Memory Programming
19. 统一内存编程 

Note 注意

This chapter applies to devices with compute capability 5.0 or higher unless stated otherwise. For devices with compute capability lower than 5.0, refer to the CUDA toolkit documentation for CUDA 11.8.
本章适用于计算能力为 5.0 或更高的设备,除非另有说明。对于计算能力低于 5.0 的设备,请参考 CUDA 11.8 的 CUDA 工具包文档。

This documentation on Unified Memory is divided into 3 parts:
这份关于统一内存的文档分为 3 个部分:

19.1. Unified Memory Introduction
19.1. 统一内存介绍 

CUDA Unified Memory provides all processors with:
CUDA 统一内存为所有处理器提供:

  • a single unified memory pool, that is, a single pointer value enables all processors in the system (all CPUs, all GPUs, etc.) to access this memory with all of their native memory operations (pointer dereferenes, atomics, etc.).
    一个单一的统一内存池,也就是说,一个单一的指针值使系统中的所有处理器(所有 CPU、所有 GPU 等)能够使用其所有本机内存操作(指针解引用、原子操作等)访问此内存。

  • concurrent access to the unified memory pool from all processors in the system.
    系统中所有处理器对统一内存池的并发访问。

Unified Memory improves GPU programming in several ways:
统一内存以多种方式改进了 GPU 编程:

  • Producitivity: GPU programs may access Unified Memory from GPU and CPU threads concurrently without needing to create separate allocations (cudaMalloc()) and copy memory manually back and forth (cudaMemcpy*()).
    生产力:GPU 程序可以同时从 GPU 和 CPU 线程访问统一内存,无需手动创建单独的分配( cudaMalloc() )并来回手动复制内存( cudaMemcpy*() )。

  • Performance: 性能:

    • Data access speed may be maximized by migrating data towards processors that access it most frequently. Applications can trigger manual migration of data and may use hints to control migration heuristics.
      数据访问速度可能通过将数据迁移到最频繁访问它的处理器而最大化。 应用程序可以触发数据的手动迁移,并可以使用提示来控制迁移启发式算法。

    • Total system memory usage may be reduced by avoiding duplicating memory on both CPUs and GPUs.
      通过避免在 CPU 和 GPU 上重复使用内存,可以减少总系统内存使用量。

  • Functionality: it enables GPU programs to work on data that exceeds the GPU memory’s capacity.
    功能:它使 GPU 程序能够处理超出 GPU 内存容量的数据。

With CUDA Unified Memory, data movement still takes place, and hints may improve performance. These hints are not required for correctness or functionality, that is, programmers may focus on parallelizing their applications across GPUs and CPUs first, and worry about data-movement later in the development cycle as a performance optimzation. Note that the physical location of data is invisible to a program and may be changed at any time, but accesses to the data’s virtual address will remain valid and coherent from any processor regardless of locality.
使用 CUDA 统一内存时,数据移动仍会发生,并且提示可能会提高性能。这些提示对于正确性或功能性并非必需,也就是说,程序员可以首先专注于在 GPU 和 CPU 之间并行化其应用程序,然后在开发周期后期担心数据移动作为性能优化。请注意,数据的物理位置对程序是不可见的,并且可以随时更改,但对数据的虚拟地址的访问将始终有效且一致,无论处理器的局部性如何。

There are two main ways to obtain CUDA Unified Memory:
获得 CUDA 统一内存有两种主要方法:

  • System-Allocated Memory: memory allocated on the host with system APIs: stack variables, global-/file-scope variables, malloc() / mmap() (see System Allocator for in-depth examples), thread locals, etc.
    系统分配的内存:使用系统 API 在主机上分配的内存:堆栈变量、全局/文件范围变量、 malloc() / mmap() (请参阅系统分配器以获取深入示例)、线程本地变量等。

  • CUDA APIs that explicitly allocate Unified Memory: memory allocated with, for example, cudaMallocManaged(), are available on more systems and may perform better than System-Allocated Memory.
    CUDA API 明确分配统一内存的 API:例如,使用 cudaMallocManaged() 分配的内存可在更多系统上使用,并且可能比系统分配的内存性能更好。

19.1.1. System Requirements for Unified Memory
19.1.1. 统一内存的系统要求 

The following table shows the different levels of support for CUDA Unified Memory, the device properties required to detect these levels of support and links to the documentation specific to each level of support:
以下表格显示了对 CUDA 统一内存的不同支持级别,以及检测这些支持级别所需的设备属性和与每个支持级别相关的文档链接:

Table 25 Overview of levels of unified memory support
表 25 统一内存支持级别概述 

Unified Memory Support Level
统一内存支持级别

System device properties 系统设备属性

Further documentation 进一步的文档

Full CUDA Unified Memory: all memory has full support. This includes System-Allocated and CUDA Managed Memory.
完整的 CUDA 统一内存:所有内存都得到全面支持。这包括系统分配的内存和 CUDA 管理的内存。

Set to 1: pageableMemoryAccess 设置为 1: pageableMemoryAccess
Systems with hardware acceleration also have the following properties set to 1:
具有硬件加速的系统还具有以下属性设置为 1:
hostNativeAtomicSupported, pageableMemoryAccessUsesHostPageTables, directManagedMemAccessFromHost

Unified Memory on devices with full CUDA Unified Memory support
具有完整 CUDA 统一内存支持的设备上的统一内存

Only CUDA Managed Memory has full support.
只有 CUDA 管理内存具有完整支持。

Set to 1: concurrentManagedAccess 设置为 1: concurrentManagedAccess
Set to 0: pageableMemoryAccess 设置为 0: pageableMemoryAccess

Unified Memory on devices with only CUDA Managed Memory support
仅支持 CUDA 托管内存的设备上的统一内存

CUDA Managed Memory without full support: unified addressing but no concurrent access.
CUDA 管理内存没有完全支持:统一寻址但无并发访问。

Set to 1: managedMemory 设置为 1: managedMemory
Set to 0: concurrentManagedAccess 设置为 0: concurrentManagedAccess

No Unified Memory support.
不支持统一内存。

Set to 0: managedMemory 设置为 0: managedMemory

CUDA for Tegra Memory Management
Tegra 内存管理的 CUDA

The behavior of an application that attempts to use Unified Memory on a system that does not support it is undefined. The following properties enable CUDA applications to check the level of system support for Unified Memory, and to be portable between systems with different levels of support:
尝试在不支持统一内存的系统上使用统一内存的应用程序行为是未定义的。以下属性使 CUDA 应用程序能够检查系统对统一内存的支持级别,并在支持级别不同的系统之间实现可移植性:

  • pageableMemoryAccess: This property is set to 1 on systems with CUDA Unified Memory support where all threads may access System-Allocated Memory and CUDA Managed Memory. These systems include NVIDIA Grace Hopper, IBM Power9 + Volta, and modern Linux systems with HMM enabled (see next bullet), among others.
    pageableMemoryAccess :在具有 CUDA 统一内存支持的系统上,此属性设置为 1,其中所有线程都可以访问系统分配的内存和 CUDA 托管内存。这些系统包括 NVIDIA Grace Hopper、IBM Power9 + Volta 以及启用 HMM 的现代 Linux 系统(请参见下一个项目符号),等等。

    • Linux HMM requires Linux kernel version 6.1.24+, 6.2.11+ or 6.3+, devices with compute capability 7.5 or higher and a CUDA driver version 535+ installed with Open Kernel Modules.
      Linux HMM 需要 Linux 内核版本 6.1.24+、6.2.11+或 6.3+,具有计算能力 7.5 或更高版本,并安装了带有 Open Kernel Modules 的 CUDA 驱动程序版本 535+的设备。

  • concurrentManagedAccess: This property is set to 1 on systems with full CUDA Managed Memory support. When this property is set to 0, there is only partial support for Unified Memory in CUDA Managed Memory. For Tegra support of Unified Memory, see CUDA for Tegra Memory Management.
    concurrentManagedAccess :在具有完整 CUDA 托管内存支持的系统上,此属性设置为 1。当此属性设置为 0 时,CUDA 托管内存中仅有对统一内存的部分支持。有关 Tegra 统一内存支持,请参阅 CUDA for Tegra 内存管理。

A program may query the level of GPU support for CUDA Unified Memory, by querying the attributes in Table Overview of levels of unified memory support above using cudaGetDeviceProperties().
程序可以通过查询上表中的属性来查询 CUDA 统一内存的 GPU 支持级别,使用 cudaGetDeviceProperties()

19.1.2. Programming Model
19.1.2. 编程模型 

With CUDA Unified Memory, separate allocations between host and device, and explicit memory transfers between them, are no longer required. Programs may allocate Unified Memory in the following ways:
使用 CUDA 统一内存,不再需要在主机和设备之间进行分配,也不需要显式地在它们之间进行内存传输。程序可以通过以下方式分配统一内存:

  • System-Allocation APIs: on systems with full CUDA Unified Memory support via any system allocation of the host process (C’s malloc(), C++’s new operator, POSIX’s mmap and so on).
    系统分配 API:在具有完整 CUDA 统一内存支持的系统上,通过主机进程的任何系统分配(C 的 malloc() ,C++的 new 运算符,POSIX 的 mmap 等)。

  • CUDA Managed Memory Allocation APIs: via the cudaMallocManaged() API which is syntactically similar to cudaMalloc().
    CUDA 管理内存分配 API:通过 cudaMallocManaged() API,其语法与 cudaMalloc() 类似。

  • CUDA Managed Variables: variables declared with __managed__, which are semantically similar to a __device__ variable.
    CUDA 管理变量:使用 __managed__ 声明的变量,语义上类似于 __device__ 变量。

Most examples in this chapter provide at least two versions, one using CUDA Managed Memory and one using System-Allocated Memory. Tabs allow you to choose between them. The following samples illustrate how Unified Memory simplifies CUDA programs:
本章中的大多数示例提供至少两个版本,一个使用 CUDA 管理内存,另一个使用系统分配的内存。选项卡允许您在它们之间进行选择。以下示例说明了统一内存如何简化 CUDA 程序:

__global__ void write_value(int* ptr, int v) {
  *ptr = v;
}

int main() {
  int* d_ptr = nullptr;
  // Does not require any unified memory support
  cudaMalloc(&d_ptr, sizeof(int));
  write_value<<<1, 1>>>(d_ptr, 1);
  int host;
  // Copy memory back to the host and synchronize
  cudaMemcpy(&host, d_ptr, sizeof(int),
             cudaMemcpyDefault);
  printf("value = %d\n", host); 
  cudaFree(d_ptr); 
  return 0;
}
__global__ void write_value(int* ptr, int v) {
  *ptr = v;
}

int main() {
  // Requires System-Allocated Memory support
  int* ptr = (int*)malloc(sizeof(int));
  write_value<<<1, 1>>>(ptr, 1);
  // Synchronize required
  // (before, cudaMemcpy was synchronizing)
  cudaDeviceSynchronize();
  printf("value = %d\n", *ptr); 
  free(ptr); 
  return 0;
}
__global__ void write_value(int* ptr, int v) {
  *ptr = v;
}

int main() {
  int* d_ptr = nullptr;
  // Does not require any unified memory support
  cudaMalloc(&d_ptr, sizeof(int));
  write_value<<<1, 1>>>(d_ptr, 1);
  int host;
  // Copy memory back to the host and synchronize
  cudaMemcpy(&host, d_ptr, sizeof(int),
             cudaMemcpyDefault);
  printf("value = %d\n", host); 
  cudaFree(d_ptr); 
  return 0;
}
__global__ void write_value(int* ptr, int v) {
  *ptr = v;
}

int main() {
  // Requires System-Allocated Memory support
  int value;
  write_value<<<1, 1>>>(&value, 1);
  // Synchronize required
  // (before, cudaMemcpy was synchronizing)
  cudaDeviceSynchronize();
  printf("value = %d\n", value);
  return 0;
}
__global__ void write_value(int* ptr, int v) {
  *ptr = v;
}

int main() {
  int* d_ptr = nullptr;
  // Does not require any unified memory support
  cudaMalloc(&d_ptr, sizeof(int));
  write_value<<<1, 1>>>(d_ptr, 1);
  int host;
  // Copy memory back to the host and synchronize
  cudaMemcpy(&host, d_ptr, sizeof(int),
             cudaMemcpyDefault);
  printf("value = %d\n", host); 
  cudaFree(d_ptr); 
  return 0;
}
__global__ void write_value(int* ptr, int v) {
  *ptr = v;
}

int main() {
  int* ptr = nullptr;
  // Requires CUDA Managed Memory support
  cudaMallocManaged(&ptr, sizeof(int));
  write_value<<<1, 1>>>(ptr, 1);
  // Synchronize required
  // (before, cudaMemcpy was synchronizing)
  cudaDeviceSynchronize();
  printf("value = %d\n", *ptr); 
  cudaFree(ptr); 
  return 0;
}
__global__ void write_value(int* ptr, int v) {
  *ptr = v;
}

int main() {
  int* d_ptr = nullptr;
  // Does not require any unified memory support
  cudaMalloc(&d_ptr, sizeof(int));
  write_value<<<1, 1>>>(d_ptr, 1);
  int host;
  // Copy memory back to the host and synchronize
  cudaMemcpy(&host, d_ptr, sizeof(int),
             cudaMemcpyDefault);
  printf("value = %d\n", host); 
  cudaFree(d_ptr); 
  return 0;
}
__global__ void write_value(int* ptr, int v) {
  *ptr = v;
}

// Requires CUDA Managed Memory support
__managed__ int value;

int main() {
  write_value<<<1, 1>>>(&value, 1);
  // Synchronize required
  // (before, cudaMemcpy was synchronizing)
  cudaDeviceSynchronize();
  printf("value = %d\n", value);
  return 0;
}

These examples combine two numbers together on the GPU with a per-thread ID returning the values in an array:
这些示例在 GPU 上将两个数字组合在一起,并使用每个线程的 ID 将值返回到数组中:

  • Without Unified Memory: both host- and device-side storage for the return values is required (host_ret and ret in the example), as is an explicit copy between the two using cudaMemcpy().
    没有统一内存:示例中需要主机和设备端存储返回值( host_retret ),并且需要使用 cudaMemcpy() 在两者之间进行显式复制。

  • With Unified Memory: GPU accesses data directly from the host. ret may be used without a separate host_ret allocation and no copy routine is required, greatly simplifying and reducing the size of the program. With:
    使用统一内存:GPU 直接从主机访问数据。 ret 可以在不需要单独的 host_ret 分配的情况下使用,并且不需要复制例程,大大简化和减小程序的大小。带有:

    • System Allocated: no other changes required.
      系统分配:无需其他更改。

    • Managed Memory: data allocation changed to use cudaMallocManaged(), which returns a pointer valid from both host and device code.
      托管内存:数据分配已更改为使用 cudaMallocManaged() ,该函数返回一个指针,可在主机和设备代码中均有效。

19.1.2.1. Allocation APIs for System-Allocated Memory
19.1.2.1. 为系统分配的内存分配 API 

On systems with full CUDA Unified Memory support, all memory is unified memory. This includes memory allocated with system allocation APIs, such as malloc(), mmap(), C++ new() operator, and also automatic variables on CPU thread stacks, thread locals, global variables, and so on.
在具有完整 CUDA 统一内存支持的系统上,所有内存都是统一内存。这包括使用系统分配 API 分配的内存,例如 malloc()mmap() ,C++ new() 运算符,以及 CPU 线程堆栈上的自动变量,线程本地变量,全局变量等。

System-Allocated Memory may be popullated on first touch, depending on the API and system settings used. First touch means that: - The allocation APIs allocate virtual memory and return immediately, and - physical memory is populated when a thread accesses the memory for the first time.
系统分配的内存可能会在第一次访问时填充,这取决于所使用的 API 和系统设置。第一次访问意味着:- 分配 API 分配虚拟内存并立即返回,- 当线程第一次访问内存时,物理内存被填充。

Usually, the physical memory will be chosen “close” to the processor that thread is running on. For example, - GPU thread accesses it first: physical GPU memory of GPU that thread runs on is chosen. - CPU thread accesses it first: physical CPU memory in the memory NUMA node of the CPU core that thread runs on is chosen.
通常,物理内存将被选择“靠近”线程正在运行的处理器。例如,- GPU 线程首先访问:选择运行该线程的 GPU 的物理 GPU 内存。- CPU 线程首先访问:选择运行该线程的 CPU 核心的内存 NUMA 节点中的物理 CPU 内存。

CUDA Unified Memory Hint and Prefetch APIs, cudaMemAdvise and cudaMemPreftchAsync, may be used on System-Allocated Memory. These APIs are covered below in the Data Usage Hints section.
CUDA 统一内存提示和预取 API, cudaMemAdvisecudaMemPreftchAsync ,可用于系统分配的内存。这些 API 在数据使用提示部分下面进行了介绍。

__global__ void printme(char *str) {
  printf(str);
}

int main() {
  // Allocate 100 bytes of memory, accessible to both Host and Device code
  char *s = (char*)malloc(100);
  // Physical allocation placed in CPU memory because host accesses "s" first
  strncpy(s, "Hello Unified Memory\n", 99);
  // Here we pass "s" to a kernel without explicitly copying
  printme<<< 1, 1 >>>(s);
  cudaDeviceSynchronize();
  // Free as for normal CUDA allocations
  cudaFree(s); 
  return  0;
}

19.1.2.2. Allocation API for CUDA Managed Memory: cudaMallocManaged()
19.1.2.2. 用于 CUDA 托管内存的分配 API: cudaMallocManaged()

On systems with CUDA Managed Memory support, unified memory may be allocated using:
在支持 CUDA 管理内存的系统上,可以使用以下方式分配统一内存:

__host__ cudaError_t cudaMallocManaged(void **devPtr, size_t size);

This API is syntactically identical to cudaMalloc(): it allocates size bytes of managed memory and sets devPtr to refer to the allocation. CUDA Managed Memory is also deallocated with cudaFree().
此 API 在语法上与 cudaMalloc() 相同:它分配 size 字节的托管内存,并将 devPtr 设置为引用该分配。CUDA 托管内存也使用 cudaFree() 进行释放。

On systems with full CUDA Managed Memory support, managed memory allocations may be accessed concurrently by all CPUs and GPUs in the system. Replacing host calls to cudaMalloc() with cudaMallocManaged(), does not impact program semantics on these systems; device code is not able to call cudaMallocManaged().
在具有完整 CUDA 托管内存支持的系统上,托管内存分配可以被系统中的所有 CPU 和 GPU 并发访问。在这些系统上,用 cudaMallocManaged() 替换主机调用 cudaMalloc() 不会影响程序语义;设备代码无法调用 cudaMallocManaged()

The following example shows the use of cudaMallocManaged():
以下示例显示了 cudaMallocManaged() 的使用:

__global__ void printme(char *str) {
  printf(str);
}

int main() {
  // Allocate 100 bytes of memory, accessible to both Host and Device code
  char *s;
  cudaMallocManaged(&s, 100);
  // Note direct Host-code use of "s"
  strncpy(s, "Hello Unified Memory\n", 99);
  // Here we pass "s" to a kernel without explicitly copying
  printme<<< 1, 1 >>>(s);
  cudaDeviceSynchronize();
  // Free as for normal CUDA allocations
  cudaFree(s); 
  return  0;
}

Note 注意

For systems that support CUDA Managed Memory allocations, but do not provide full support, see Coherency and Concurrency. Implementation details (may change any time):
对于支持 CUDA 管理内存分配但不提供完整支持的系统,请参阅一致性和并发性。实现细节(可能随时更改):

  • Devices of compute capability 5.x allocate CUDA Managed Memory on the GPU.
    计算能力为 5.x 的设备在 GPU 上分配 CUDA 托管内存。

  • Devices of compute capability 6.x and greater populate the memory on first touch, just like System-Allocated Memory APIs.
    计算能力为 6.x 及更高的设备在首次触及内存时填充内存,就像系统分配的内存 API 一样。

19.1.2.3. Global-Scope Managed Variables Using __managed__
19.1.2.3. 使用 __managed__ 进行全局范围管理的变量 

CUDA __managed__ variables behave as if they were allocated via cudaMallocManaged() (see Explicit Allocation Using cudaMallocManaged() ). They simplify programs with global variables, making it particularly easy to exchange data between host and device without manual allocations or copying.
CUDA __managed__ 变量的行为就像它们是通过 cudaMallocManaged() 分配的一样(请参阅使用 cudaMallocManaged() 进行显式分配)。它们简化了具有全局变量的程序,特别是在主机和设备之间交换数据时,无需手动分配或复制。

On systems with full CUDA Unified Memory support, file-scope or global-scope variables cannot be directly accessed by device code. But a pointer to these variables may be passed to the kernel as an argument, see System Allocator for examples.
在具有完整 CUDA 统一内存支持的系统上,文件范围或全局范围变量不能直接被设备代码访问。但是这些变量的指针可以作为参数传递给内核,参见示例中的系统分配器。

__global__ void write_value(int* ptr, int v) {
  *ptr = v;
}

int main() {
  // Requires System-Allocated Memory support
  int value;
  write_value<<<1, 1>>>(&value, 1);
  // Synchronize required
  // (before, cudaMemcpy was synchronizing)
  cudaDeviceSynchronize();
  printf("value = %d\n", value);
  return 0;
}
__global__ void write_value(int* ptr, int v) {
  *ptr = v;
}

// Requires CUDA Managed Memory support
__managed__ int value;

int main() {
  write_value<<<1, 1>>>(&value, 1);
  // Synchronize required
  // (before, cudaMemcpy was synchronizing)
  cudaDeviceSynchronize();
  printf("value = %d\n", value);
  return 0;
}

Note the absence of explicit cudaMemcpy() commands and the fact that the return array ret is visible on both CPU and GPU.
请注意没有明确的 cudaMemcpy() 命令以及返回数组 ret 在 CPU 和 GPU 上都可见。

CUDA __managed__ variable implies __device__ and is equivalent to __managed__ __device__, which is also allowed. Variables marked __constant__ may not be marked as __managed__.
CUDA __managed__ 变量意味着 __device__ 并且等同于 __managed__ __device__ ,这也是允许的。标记为 __constant__ 的变量可能不会标记为 __managed__

A valid CUDA context is necessary for the correct operation of __managed__ variables. Accessing __managed__ variables can trigger CUDA context creation if a context for the current device hasn’t already been created. In the example above, accessing x before the kernel launch triggers context creation on device 0. In the absence of that access, the kernel launch would have triggered context creation.
正确操作 __managed__ 变量需要有效的 CUDA 上下文。如果当前设备尚未创建上下文,则访问 __managed__ 变量可能会触发 CUDA 上下文的创建。在上面的示例中,在启动内核之前访问 x 会触发在设备 0 上创建上下文。如果没有这种访问,内核启动将会触发上下文的创建。

C++ objects declared as __managed__ are subject to certain specific constraints, particularly where static initializers are concerned. Please refer to C++ Language Support for a list of these constraints.
C++对象声明为 __managed__ 时,特别是在涉及静态初始化程序时,受到某些特定约束。请参阅 C++语言支持以获取这些约束的列表。

Note 注意

For devices with CUDA Managed Memory without full support, visibility of __managed__ variables for asynchronous operations executing in CUDA streams is discussed in the section on Managing Data Visibility and Concurrent CPU + GPU Access with Streams.
对于没有完全支持的具有 CUDA 托管内存的设备,在 CUDA 流中执行的异步操作的__managed__变量的可见性在“管理数据可见性和使用流进行并发 CPU + GPU 访问”部分进行了讨论。

19.1.2.4. Difference between Unified Memory and Mapped Memory
19.1.2.4. 统一内存和映射内存之间的区别 

The main difference between Unified Memory and CUDA Mapped Memory is that CUDA Mapped Memory does not guarantee that all kinds of memory accesses (for example atomics) are supported on all systems, while Unified Memory does. The limited set of memory operations that are guaranteed to be portably supported by CUDA Mapped Memory is available on more systems than Unified Memory.
Unified Memory 和 CUDA 映射内存之间的主要区别在于,CUDA 映射内存不能保证所有类型的内存访问(例如原子操作)在所有系统上都受支持,而 Unified Memory 可以。CUDA 映射内存保证可在更多系统上使用的有限一组内存操作,而 Unified Memory 不具备这一保证。

19.1.2.5. Pointer Attributes
19.1.2.5. 指针属性 

CUDA Programs may check whether a pointer addresses a CUDA Managed Memory allocation by calling cudaPointerGetAttributes() and testing whether the pointer attribute value is cudaMemoryTypeManaged.
CUDA 程序可以通过调用 cudaPointerGetAttributes() 并测试指针属性 value 是否为 cudaMemoryTypeManaged 来检查指针是否指向 CUDA 托管内存分配。

This API returns cudaMemoryTypeHost for system-allocated memory that has been registered with cudaHostRegister() and cudaMemoryTypeUnregistered for system-allocated memory that CUDA is unaware of.
此 API 返回已向 cudaHostRegister() 注册的系统分配内存的 cudaMemoryTypeHost ,以及 CUDA 不知晓的系统分配内存的 cudaMemoryTypeUnregistered

Pointer attributes do not state where the memory resides, they state how the memory was allocated or registered.
指针属性并不说明内存所在位置,而是说明内存是如何分配或注册的。

The following example shows how to detect the type of pointer at runtime:
以下示例显示如何在运行时检测指针的类型:

char const* kind(cudaPointerAttributes a, bool pma, bool cma) {
    switch(a.type) {
    case cudaMemoryTypeHost: return pma?
      "Unified: CUDA Host or Registered Memory" :
      "Not Unified: CUDA Host or Registered Memory";
    case cudaMemoryTypeDevice: return "Not Unified: CUDA Device Memory";
    case cudaMemoryTypeManaged: return cma?
      "Unified: CUDA Managed Memory" : "Not Unified: CUDA Managed Memory";
    case cudaMemoryTypeUnregistered: return pma?
      "Unified: System-Allocated Memory" :
      "Not Unified: System-Allocated Memory";
    default: return "unknown";
    }
}

void check_pointer(int i, void* ptr) {
  cudaPointerAttributes attr;
  cudaPointerGetAttributes(&attr, ptr);
  int pma = 0, cma = 0, device = 0;
  cudaGetDevice(&device);
  cudaDeviceGetAttribute(&pma, cudaDevAttrPageableMemoryAccess, device);
  cudaDeviceGetAttribute(&cma, cudaDevAttrConcurrentManagedAccess, device);
  printf("Pointer %d: memory is %s\n", i, kind(attr, pma, cma));
}

__managed__ int managed_var = 5;

int main() {
  int* ptr[5];
  ptr[0] = (int*)malloc(sizeof(int));
  cudaMallocManaged(&ptr[1], sizeof(int));
  cudaMallocHost(&ptr[2], sizeof(int));
  cudaMalloc(&ptr[3], sizeof(int));
  ptr[4] = &managed_var;

  for (int i = 0; i < 5; ++i) check_pointer(i, ptr[i]);
  
  cudaFree(ptr[3]);
  cudaFreeHost(ptr[2]);
  cudaFree(ptr[1]);
  free(ptr[0]);
  return 0;
}

19.1.2.6. Runtime detection of Unified Memory Support Level
19.1.2.6. 统一内存支持级别的运行时检测 

The following example shows how to detect the Unified Memory support level at runtime:
以下示例显示如何在运行时检测统一内存支持级别:

int main() {
  int d;
  cudaGetDevice(&d);

  int pma = 0;
  cudaDeviceGetAttribute(&pma, cudaDevAttrPageableMemoryAccess, d);
  printf("Full Unified Memory Support: %s\n", pma == 1? "YES" : "NO");
  
  int cma = 0;
  cudaDeviceGetAttribute(&cma, cudaDevAttrConcurrentManagedAccess, d);
  printf("CUDA Managed Memory with full support: %s\n", cma == 1? "YES" : "NO");

  return 0;
}

19.1.2.7. GPU Memory Oversubscription
19.1.2.7. GPU 内存超额分配 

Unified Memory enables applications to oversubscribe the memory of any individual processor: in other words they can allocate and share arrays larger than the memory capacity of any individual processor in the system, enabling among others out-of-core processing of datasets that do not fit within a single GPU, without adding significant complexity to the programming model.
统一内存使应用程序能够超额订阅任何单个处理器的内存:换句话说,它们可以分配和共享比系统中任何单个处理器的内存容量更大的数组,从而实现数据集的离核处理,而无需向编程模型添加显着复杂性。

19.1.2.8. Performance Hints
19.1.2.8. 性能提示 

The following sections describes the available unified memory performance hints, which may be used on all Unified Memory, for example, CUDA Managed memory or, on systems with full CUDA Unified Memory support, also all System-Allocated Memory. These APIs are hints, that is, they do not impact the semantics of applications, only their peformance. That is, they can be added or removed anywhere on any application without impacting its results.
以下部分描述了可用的统一内存性能提示,这些提示可用于所有统一内存,例如,CUDA 托管内存,或者在具有完整 CUDA 统一内存支持的系统上,也可用于所有系统分配的内存。这些 API 是提示,即它们不会影响应用程序的语义,只会影响其性能。也就是说,它们可以在任何应用程序的任何位置添加或删除,而不会影响其结果。

CUDA Unified Memory may not always have all the information necessary to make the best performance decisions related to unified memory. These performance hints enable the application to provide CUDA with more information.
CUDA 统一内存可能并不总是具有使与统一内存相关的最佳性能决策的所有必要信息。这些性能提示使应用程序能够向 CUDA 提供更多信息。

Note that applications should only use these hints if they improve their performance.
请注意,应用程序只应在它们能够提高性能时才使用这些提示。

19.1.2.8.1. Data Prefetching
19.1.2.8.1. 数据预取

The cudaMemPrefetchAsync API is an asynchronous stream-ordered API that may migrate data to reside closer to the specified processor. The data may be accessed while it is being prefetched. The migration does not begin until all prior operations in the stream have completed, and completes before any subsequent operation in the stream.
cudaMemPrefetchAsync API 是一种异步流有序 API,可以将数据迁移到靠近指定处理器的位置。数据可以在预取时访问。迁移直到流中所有先前操作完成之后才开始,并在流中的任何后续操作之前完成。

cudaError_t cudaMemPrefetchAsync(const void *devPtr,
                                 size_t count,
                                 int dstDevice,
                                 cudaStream_t stream);

A memory region containing [devPtr, devPtr + count) may be migrated to the destination device dstDevice - or CPU if cudaCpuDeviceId used - when the prefetch task is executed in the given stream.
包含 [devPtr, devPtr + count) 的内存区域可能在给定的 stream 中执行预取任务时迁移到目标设备 dstDevice (如果使用 cudaCpuDeviceId ,则迁移到 CPU)。

Consider a simple code example below:
考虑下面的一个简单代码示例:

void test_prefetch_sam(cudaStream_t s) {
  char *data = (char*)malloc(N);
  init_data(data, N);                                     // execute on CPU
  cudaMemPrefetchAsync(data, N, myGpuId, s);              // prefetch to GPU
  mykernel<<<(N + TPB - 1) / TPB, TPB, 0, s>>>(data, N);  // execute on GPU
  cudaMemPrefetchAsync(data, N, cudaCpuDeviceId, s);      // prefetch to CPU
  cudaStreamSynchronize(s);
  use_data(data, N);
  free(data);
}
void test_prefetch_managed(cudaStream_t s) {
  char *data;
  cudaMallocManaged(&data, N);
  init_data(data, N);                                     // execute on CPU
  cudaMemPrefetchAsync(data, N, myGpuId, s);              // prefetch to GPU
  mykernel<<<(N + TPB - 1) / TPB, TPB, 0, s>>>(data, N);  // execute on GPU
  cudaMemPrefetchAsync(data, N, cudaCpuDeviceId, s);      // prefetch to CPU
  cudaStreamSynchronize(s);
  use_data(data, N);
  cudaFree(data);
}
19.1.2.8.2. Data Usage Hints
19.1.2.8.2. 数据使用提示 

When multiple processors simultaneously access the same data, cudaMemAdvise may be used to hint how the data at [devPtr, devPtr + count) will be accessed:
当多个处理器同时访问相同数据时, cudaMemAdvise 可用于提示将如何访问 [devPtr, devPtr + count) 处的数据:

cudaError_t cudaMemAdvise(const void *devPtr,
                          size_t count,
                          enum cudaMemoryAdvise advice,
                          int device);

Where advice may take the following values:
advice 可以取以下值:

  • cudaMemAdviseSetReadMostly: This implies that the data is mostly going to be read from and only occasionally written to. In general, it allows trading off read bandwidth for write bandwidth on this region. Example:
    cudaMemAdviseSetReadMostly :这意味着数据主要是从中读取,只偶尔写入。一般来说,这允许在该区域上交换读带宽和写带宽。示例:

void test_advise_managed(cudaStream_t stream) {
  char *dataPtr;
  size_t dataSize = 64 * TPB;  // 16 KiB
  // Allocate memory using cudaMallocManaged
  // (malloc may be used on systems with full CUDA Unified memory support)
  cudaMallocManaged(&dataPtr, dataSize);
  // Set the advice on the memory region
  cudaMemAdvise(dataPtr, dataSize, cudaMemAdviseSetReadMostly, myGpuId);
  int outerLoopIter = 0;
  while (outerLoopIter < maxOuterLoopIter) {
    // The data is written to in the outer loop on the CPU
    init_data(dataPtr, dataSize);
    // The data is made available to all GPUs by prefetching.
    // Prefetching here causes read duplication of data instead
    // of data migration
    for (int device = 0; device < maxDevices; device++) {
      cudaMemPrefetchAsync(dataPtr, dataSize, device, stream);
    }
    // The kernel only reads this data in the inner loop
    int innerLoopIter = 0;
    while (innerLoopIter < maxInnerLoopIter) {
      mykernel<<<32, TPB, 0, stream>>>((const char *)dataPtr, dataSize);
      innerLoopIter++;
    }
    outerLoopIter++;
  }
  cudaFree(dataPtr);
}
  • cudaMemAdviseSetPreferredLocation: In general, any memory may be migrated at any time to any location, for example, when a given processor is running out of physical memory. This hint tells the system that migrating this memory region away from its preferred location is undesired, by setting the preferred location for the data to be the physical memory belonging to device. Passing in a value of cudaCpuDeviceId for device sets the preferred location as CPU memory. Other hints, like cudaMemPrefetchAsync, may override this hint, leading the memory to be migrated away from its preferred location.
    cudaMemAdviseSetPreferredLocation :通常,任何内存都可以随时迁移至任何位置,例如,当特定处理器的物理内存不足时。此提示告诉系统,将该内存区域迁移至其首选位置之外是不希望的,通过将数据的首选位置设置为属于设备的物理内存。传入设备的值为 cudaCpuDeviceId 将首选位置设置为 CPU 内存。其他提示,如 cudaMemPrefetchAsync ,可能会覆盖此提示,导致内存迁移到其首选位置之外。

  • cudaMemAdviseSetAccessedBy: In some systems, it may be beneficial for performance to establish a mapping into memory before accessing the data from a given processor. This hint tells the system that the data will be frequently accessed by device, enabling the system to assume that creating these mappings pays off. This hint does not imply where the data should reside, but it can be combined with cudaMemAdviseSetPreferredLocation to specify that.
    cudaMemAdviseSetAccessedBy :在某些系统中,为了提高性能,在从给定处理器访问数据之前建立内存映射可能是有益的。此提示告诉系统数据将经常被 device 访问,使系统可以假设创建这些映射是值得的。此提示并不意味着数据应该驻留在哪里,但可以与 cudaMemAdviseSetPreferredLocation 结合使用以指定。

Each advice can be also unset by using one of the following values: cudaMemAdviseUnsetReadMostly, cudaMemAdviseUnsetPreferredLocation and cudaMemAdviseUnsetAccessedBy.
每个建议也可以通过使用以下值之一来取消设置: cudaMemAdviseUnsetReadMostlycudaMemAdviseUnsetPreferredLocationcudaMemAdviseUnsetAccessedBy

19.1.2.8.3. Querying Data Usage Attributes on Managed Memory
19.1.2.8.3. 在托管内存上查询数据使用属性 

A program can query memory range attributes assigned through cudaMemAdvise or cudaMemPrefetchAsync on CUDA Managed Memory by using the following API:
程序可以通过以下 API 查询通过 cudaMemAdvisecudaMemPrefetchAsync 分配的 CUDA 托管内存的内存范围属性:

cudaMemRangeGetAttribute(void *data,
                         size_t dataSize,
                         enum cudaMemRangeAttribute attribute,
                         const void *devPtr,
                         size_t count);

This function queries an attribute of the memory range starting at devPtr with a size of count bytes. The memory range must refer to managed memory allocated via cudaMallocManaged or declared via __managed__ variables. It is possible to query the following attributes:
此函数查询从 devPtr 开始大小为 count 字节的内存范围的属性。内存范围必须是通过 cudaMallocManaged 分配的托管内存或通过 __managed__ 变量声明的。可以查询以下属性:

  • cudaMemRangeAttributeReadMostly: the result returned will be 1 if the entire memory range has the cudaMemAdviseSetReadMostly attribute set, or 0 otherwise.
    cudaMemRangeAttributeReadMostly :如果整个内存范围具有设置 cudaMemAdviseSetReadMostly 属性,则返回的结果将为 1,否则为 0。

  • cudaMemRangeAttributePreferredLocation: the result returned will be a GPU device id or cudaCpuDeviceId if the entire memory range has the corresponding processor as preferred location, otherwise cudaInvalidDeviceId will be returned. An application can use this query API to make decision about staging data through CPU or GPU depending on the preferred location attribute of the managed pointer. Note that the actual location of the memory range at the time of the query may be different from the preferred location.
    cudaMemRangeAttributePreferredLocation :返回的结果将是 GPU 设备 ID,或者如果整个内存范围具有相应处理器作为首选位置,则返回 cudaCpuDeviceId ,否则将返回 cudaInvalidDeviceId 。应用程序可以使用此查询 API 根据受管理指针的首选位置属性决定通过 CPU 还是 GPU 进行数据分段。请注意,在查询时内存范围的实际位置可能与首选位置不同。

  • cudaMemRangeAttributeAccessedBy: will return the list of devices that have that advise set for that memory range.
    cudaMemRangeAttributeAccessedBy :将返回设置了该内存范围建议的设备列表。

  • cudaMemRangeAttributeLastPrefetchLocation: will return the last location to which the memory range was prefetched explicitly using cudaMemPrefetchAsync. Note that this simply returns the last location that the application requested to prefetch the memory range to. It gives no indication as to whether the prefetch operation to that location has completed or even begun.
    cudaMemRangeAttributeLastPrefetchLocation :将返回最后一个位置,该位置是使用 cudaMemPrefetchAsync 明确预取内存范围的位置。请注意,这只是返回应用程序请求预取内存范围的最后一个位置。它不指示预取操作是否已完成或甚至已开始。

Additionally, multiple attributes can be queried by using corresponding cudaMemRangeGetAttributes function.
此外,可以使用相应的 cudaMemRangeGetAttributes 函数查询多个属性。

19.2. Unified memory on devices with full CUDA Unified Memory support
19.2. 具有完整 CUDA 统一内存支持的设备上的统一内存

19.2.1. System-Allocated Memory: in-depth examples
19.2.1. 系统分配的内存:深入示例 

Systems with full CUDA Unified Memory support allow the device to access any memory owned by the host process interacting with the device. This section shows a few advanced use-cases, using a kernel that simply prints the first 8 characters of an input character array to the standard output stream:
具有完整 CUDA 统一内存支持的系统允许设备访问由与设备交互的主机进程拥有的任何内存。本节展示了一些高级用例,使用一个简单打印输入字符数组的前 8 个字符到标准输出流的内核:

__global__ void kernel(const char* type, const char* data) {
  static const int n_char = 8;
  printf("%s - first %d characters: '", type, n_char);
  for (int i = 0; i < n_char; ++i) printf("%c", data[i]);
  printf("'\n");
}

The following tabs show various ways of how this kernel may be called:
以下选项卡显示了调用此内核的各种方式:

void test_malloc() {
  const char test_string[] = "Hello World";
  char* heap_data = (char*)malloc(sizeof(test_string));
  strncpy(heap_data, test_string, sizeof(test_string));
  kernel<<<1, 1>>>("malloc", heap_data);
  ASSERT(cudaDeviceSynchronize() == cudaSuccess,
    "CUDA failed with '%s'", cudaGetErrorString(cudaGetLastError()));
  free(heap_data);
}
void test_managed() {
  const char test_string[] = "Hello World";
  char* data;
  cudaMallocManaged(&data, sizeof(test_string));
  strncpy(data, test_string, sizeof(test_string));
  kernel<<<1, 1>>>("managed", data);
  ASSERT(cudaDeviceSynchronize() == cudaSuccess,
    "CUDA failed with '%s'", cudaGetErrorString(cudaGetLastError()));
  cudaFree(data);
}
void test_stack() {
  const char test_string[] = "Hello World";
  kernel<<<1, 1>>>("stack", test_string);
  ASSERT(cudaDeviceSynchronize() == cudaSuccess,
    "CUDA failed with '%s'", cudaGetErrorString(cudaGetLastError()));
}
void test_static() {
  static const char test_string[] = "Hello World";
  kernel<<<1, 1>>>("static", test_string);
  ASSERT(cudaDeviceSynchronize() == cudaSuccess,
    "CUDA failed with '%s'", cudaGetErrorString(cudaGetLastError()));
}
const char global_string[] = "Hello World";

void test_global() {
  kernel<<<1, 1>>>("global", global_string);
  ASSERT(cudaDeviceSynchronize() == cudaSuccess,
    "CUDA failed with '%s'", cudaGetErrorString(cudaGetLastError()));
}
// declared in separate file, see below
extern char* ext_data;

void test_extern() {
  kernel<<<1, 1>>>("extern", ext_data);
  ASSERT(cudaDeviceSynchronize() == cudaSuccess,
    "CUDA failed with '%s'", cudaGetErrorString(cudaGetLastError()));
}
/** This may be a non-CUDA file */
char* ext_data;
static const char global_string[] = "Hello World";

void __attribute__ ((constructor)) setup(void) {
  ext_data = (char*)malloc(sizeof(global_string));
  strncpy(ext_data, global_string, sizeof(global_string));
}

void __attribute__ ((destructor)) tear_down(void) {
  free(ext_data);
}

The first three tabs above show the example as already detailed in the Programming Model section. The next three tabs show various ways a file-scope or global-scope variable can be accessed from the device.
上面的前三个选项卡显示了在编程模型部分已经详细介绍的示例。接下来的三个选项卡显示了文件范围或全局范围变量可以从设备中访问的各种方式。

Note that for the extern variable, it could be declared and its memory owned and managed by a third-party library, which does not interact with CUDA at all.
请注意,对于外部变量,它可以被声明,并且其内存由第三方库拥有和管理,该库与 CUDA 没有任何交互。

Also note that stack variables as well as file-scope and global-scope variables can only be accessed through a pointer by the GPU. In this specific example, this is convenient because the character array is already declared as a pointer: const char*. However, consider the following example with a global-scope integer:
还要注意,栈变量以及文件作用域和全局作用域变量只能通过 GPU 指针访问。在这个具体的示例中,这很方便,因为字符数组已经声明为指针: const char* 。然而,考虑以下具有全局作用域整数的示例:

// this variable is declared at global scope
int global_variable;

__global__ void kernel_uncompilable() {
  // this causes a compilation error: global (__host__) variables must not
  // be accessed from __device__ / __global__ code
  printf("%d\n", global_variable);
}

// On systems with pageableMemoryAccess set to 1, we can access the address
// of a global variable. The below kernel takes that address as an argument
__global__ void kernel(int* global_variable_addr) {
  printf("%d\n", *global_variable_addr);
}
int main() {
  kernel<<<1, 1>>>(&global_variable);
  ...
  return 0;
}

In the example above, we need to ensure to pass a pointer to the global variable to the kernel instead of directly accessing the global variable in the kernel. This is because global variables without the __managed__ specifier are declared as __host__-only by default, thus most compilers won’t allow using these variables directly in device code as of now.
在上面的示例中,我们需要确保将指针传递给内核中的全局变量,而不是直接在内核中访问全局变量。这是因为没有 __managed__ 修饰符的全局变量默认情况下被声明为 __host__ ,因此大多数编译器目前不允许直接在设备代码中使用这些变量。

19.2.1.1. File-backed Unified Memory
19.2.1.1. 文件支持的统一内存 

Since systems with full CUDA Unified Memory support allow the device to access any memory owned by the host process, they can directly access file-backed memory.
由于具有完整 CUDA 统一内存支持的系统允许设备访问主机进程拥有的任何内存,因此它们可以直接访问文件支持的内存。

Here, we show a modified version of the initial example shown in the previous section to use file-backed memory in order to print a string from the GPU, read directly from an input file. In the following example, the memory is backed by a physical file, but the example applies to memory-backed files, too, as shown in the section on Inter-Process Communication with Unified Memory.
在这里,我们展示了前一节中显示的初始示例的修改版本,以使用文件支持的内存来打印来自 GPU 的字符串,直接从输入文件中读取。在下面的示例中,内存由物理文件支持,但示例也适用于内存支持的文件,正如在统一内存的进程间通信部分所示。

__global__ void kernel(const char* type, const char* data) {
  static const int n_char = 8;
  printf("%s - first %d characters: '", type, n_char);
  for (int i = 0; i < n_char; ++i) printf("%c", data[i]);
  printf("'\n");
}
void test_file_backed() {
  int fd = open(INPUT_FILE_NAME, O_RDONLY);
  ASSERT(fd >= 0, "Invalid file handle");
  struct stat file_stat;
  int status = fstat(fd, &file_stat);
  ASSERT(status >= 0, "Invalid file stats");
  char* mapped = (char*)mmap(0, file_stat.st_size, PROT_READ, MAP_PRIVATE, fd, 0);
  ASSERT(mapped != MAP_FAILED, "Cannot map file into memory");
  kernel<<<1, 1>>>("file-backed", mapped);
  ASSERT(cudaDeviceSynchronize() == cudaSuccess,
    "CUDA failed with '%s'", cudaGetErrorString(cudaGetLastError()));
  ASSERT(munmap(mapped, file_stat.st_size) == 0, "Cannot unmap file");
  ASSERT(close(fd) == 0, "Cannot close file");
}

Note that on systems without the hostNativeAtomicSupported property, including systems with Linux HMM enabled, atomic accesses to file-backed memory are not supported.
请注意,在没有 hostNativeAtomicSupported 属性的系统上,包括启用了 Linux HMM 的系统,不支持对文件支持的内存进行原子访问。

19.2.1.2. Inter-Process Communication (IPC) with Unified Memory
19.2.1.2. 使用统一内存进行进程间通信(IPC) 

Note 注意

As of now, using IPC with Unified Memory can have significant performance implications.
目前,使用统一内存的 IPC 可能会对性能产生重大影响。

Many applications prefer to manage one GPU per process, but still need to use Unified Memory, for example for over-subscription, and access it from multiple GPUs.
许多应用程序更喜欢每个进程管理一个 GPU,但仍然需要使用统一内存,例如用于超额订阅,并从多个 GPU 访问它。

CUDA IPC (see Interprocess Communication ) does not support Managed Memory: handles to this type of memory may not be shared through any of the mechanisms discussed in this section. On systems with full CUDA Unified Memory support, System-Allocated Memory is Inter-Process Communication (IPC) capable. Once access to System-Allocated Memory has been shared with other processes, the same Unified Memory Programming Model applies, similar to File-backed Unified Memory.
CUDA IPC(请参阅进程间通信)不支持托管内存:此类内存的句柄不能通过本节讨论的任何机制共享。在具有完整 CUDA 统一内存支持的系统上,系统分配的内存支持进程间通信(IPC)。一旦将对系统分配的内存的访问权限与其他进程共享,相同的统一内存编程模型适用,类似于基于文件的统一内存。

See the following references for more information on various ways of creating IPC-capable System-Allocated Memory under Linux:
请参阅以下参考资料,了解在 Linux 下创建 IPC-capable System-Allocated Memory 的各种方法

Note that it is not possible to share memory between different hosts and their devices using this technique.
请注意,使用此技术无法在不同主机及其设备之间共享内存。

19.2.2. Performance Tuning
19.2.2. 性能调优 

In order to achieve good performance with Unified Memory, it is important to:
为了实现统一内存的良好性能,重要的是:

  • Understand how paging works on your system, and how to avoid unnecessary page faults.
    了解系统上分页是如何工作的,以及如何避免不必要的页面错误。

  • Understand the various mechanisms allowing to keep data local to the accessing processor.
    了解各种机制,允许将数据保留在访问处理器本地。

  • Consider tuning your application for the granularity of memory transfers of your system.
    考虑调整应用程序以适应系统内存传输的粒度。

As general advice, Unified Memory Performance Hints might provide improved performance, but using them incorrectly might degrade performance compared to the default behavior. Also note that any hint has a performance cost associated with it on the host, thus useful hints must at the very least improve performance enough to overcome this cost.
作为一般建议,统一内存性能提示可能会提供改进的性能,但如果使用不当,可能会降低性能,与默认行为相比。还要注意,任何提示在主机上都会带来性能成本,因此有用的提示至少必须提高足够的性能以克服这种成本。

19.2.2.1. Memory Paging and Page Sizes
19.2.2.1. 内存分页和页面大小 

Many of the sections for unified memory performance tuning assume prior knowledge on virtual addressing, memory pages and page sizes. This section attempts to define all necessary terms and explain why paging matters for performance.
许多统一内存性能调优部分假定读者已具备有关虚拟寻址、内存页和页面大小的先验知识。本节试图定义所有必要术语,并解释为什么分页对性能至关重要。

All currently supported systems for Unified Memory use a virtual address space: this means that memory addresses used by an application represent a virtual location which might be mapped to a physical location where the memory actually resides.
所有当前支持的统一内存系统都使用虚拟地址空间:这意味着应用程序使用的内存地址代表一个虚拟位置,该位置可能映射到内存实际驻留的物理位置。

All currently supported processors, including both CPUs and GPUs, additionally use memory paging. Because all systems use a virtual address space, there are two types of memory pages:
所有当前支持的处理器,包括 CPU 和 GPU,都使用内存分页。由于所有系统都使用虚拟地址空间,因此有两种类型的内存页:

  • Virtual pages: this represents a fixed-size contiguous chunk of virtual memory per process tracked by the operating system, which can be mapped into physical memory. Note that the virtual page is linked to the mapping: for example, a single virtual address might be mapped into physical memory using different page sizes.
    虚拟页面:这代表操作系统跟踪的每个进程的固定大小连续虚拟内存块,可以映射到物理内存。请注意,虚拟页面与映射相关联:例如,单个虚拟地址可能使用不同的页面大小映射到物理内存。

  • Physical pages: this represents a fixed-size contiguous chunk of memory the processor’s main Memory Management Unit (MMU) supports and into which a virtual page can be mapped.
    物理页面:这代表处理器的主内存管理单元(MMU)支持的固定大小的连续内存块,虚拟页面可以映射到其中。

Currently, all x86_64 CPUs use 4KiB physical pages. Arm CPUs support multiple physical page sizes - 4KiB, 16KiB, 32KiB and 64KiB - depending on the exact CPU. Finally, NVIDIA GPUs support multiple physical page sizes, but prefer 2MiB physical pages or larger. Note that these sizes are subject to change in future hardware.
目前,所有 x86_64 CPU 使用 4KiB 物理页。Arm CPU 支持多个物理页大小 - 4KiB、16KiB、32KiB 和 64KiB - 具体取决于 CPU 的型号。最后,NVIDIA GPU 支持多个物理页大小,但更倾向于使用 2MiB 或更大的物理页。请注意,这些大小可能会在未来的硬件中发生变化。

The default page size of virtual pages usually corresponds to the physical page size, but an application may use different page sizes as long as they are supported by the operating system and the hardware. Typically, supported virtual page sizes must be powers of 2 and multiples of the physical page size.
虚拟页面的默认页面大小通常对应于物理页面大小,但应用程序可以使用不同的页面大小,只要它们受操作系统和硬件支持。通常,支持的虚拟页面大小必须是 2 的幂,并且是物理页面大小的倍数。

The logical entity tracking the mapping of virtual pages into physical pages will be referred to as a page table, and each mapping of a given virtual page with a given virtual size to physical pages is called a page table entry (PTE). All supported processors provide specific caches for the page table to speed up the translation of virtual addresses to physical addresses. These caches are called translation lookaside buffers (TLBs).
逻辑实体跟踪虚拟页面到物理页面的映射将被称为页表,给定虚拟大小的虚拟页面与物理页面的每个映射被称为页表条目(PTE)。所有支持的处理器都提供特定的缓存用于页表,以加快虚拟地址到物理地址的转换。这些缓存被称为转换后插查缓冲器(TLB)。

There are two important aspects for performance tuning of applications:
应用程序性能调优有两个重要方面:

  • the choice of virtual page size,
    虚拟页面大小的选择

  • whether the system offers a combined page table used by both CPUs and GPUs, or separate page tables for each CPU and GPU individually.
    系统是否提供一个由 CPU 和 GPU 共同使用的组合页表,还是为每个 CPU 和 GPU 分别提供单独的页表。

19.2.2.1.1. Choosing the right page size
19.2.2.1.1. 选择正确的页面大小 

In general, small page sizes lead to less (virtual) memory fragmentation but more TLB misses, whereas larger page sizes lead to more memory fragmentation but less TLB misses. Additionally, memory migration is generally more expensive with larger page sizes compared to smaller page sizes, because we typically migrate full memory pages. This can cause larger latency spikes in an application using large page sizes. See also the next section for more details on page faults.
通常,较小的页面大小会导致较少(虚拟)内存碎片化,但会增加 TLB 未命中次数,而较大的页面大小会导致更多的内存碎片化但减少 TLB 未命中次数。此外,与较小页面大小相比,使用较大页面大小通常会导致内存迁移成本更高,因为我们通常迁移完整的内存页面。这可能会导致在使用大页面大小的应用程序中出现较大的延迟峰值。有关页面错误的更多详细信息,请参阅下一节。

One important aspect for performance tuning is that TLB misses are generally significantly more expensive on the GPU compared to the CPU. This means that if a GPU thread frequently accesses random locations of Unified Memory mapped using a small enough page size, it might be significantly slower compared to the same accesses to Unified Memory mapped using a large enough page size. While a similar effect might occur for a CPU thread randomly accessing a large area of memory mapped using a small page size, the slowdown is less pronounced, meaning that the application might want to trade-off this slowdown with having less memory fragmentation.
性能调优的一个重要方面是,与 CPU 相比,GPU 上的 TLB 未命中通常显着更昂贵。这意味着,如果 GPU 线程频繁访问使用足够小的页面大小映射的统一内存的随机位置,那么与使用足够大的页面大小映射的统一内存相比,它可能会显着变慢。虽然 CPU 线程随机访问使用小页面大小映射的大内存区域可能会出现类似效果,但减速效果不那么明显,这意味着应用程序可能希望在减少内存碎片化和减慢速度之间进行权衡。

Note that in general, applications should not tune their performance to the physical page size of a given processor, since physical page sizes are subject to change depending on the hardware. The advice above only applies to virtual page sizes.
请注意,一般情况下,应用程序不应该根据特定处理器的物理页面大小来调整性能,因为物理页面大小可能会随硬件变化而改变。上述建议仅适用于虚拟页面大小。

19.2.2.1.2. CPU and GPU page tables: hardware coherency vs. software coherency
19.2.2.1.2. CPU 和 GPU 页表:硬件一致性 vs. 软件一致性 

Note 注意

In the remainder of the performance tuning documentation, we will refer to systems with a combined page table for both CPUs and GPUs as hardware coherent systems. Systems with separate page tables for CPUs and GPUs are referred to as software coherent.
在性能调整文档的其余部分中,我们将把同时包含 CPU 和 GPU 的页面表的系统称为硬件一致系统。具有单独页面表的 CPU 和 GPU 的系统被称为软件一致。

Hardware coherent systems such as NVIDIA Grace Hopper offer a logically combined page table for both CPUs and GPUs. This is important because in order to access System-Allocated Memory from the GPU, the GPU uses whichever page table entry was created by the CPU for the requested memory. If that page table entry uses the default CPU page size of 4KiB or 64KiB, accesses to large virtual memory areas will cause significant TLB misses, thus significant slowdowns.
硬件一致性系统(如 NVIDIA Grace Hopper)为 CPU 和 GPU 提供了逻辑上结合的页表。这很重要,因为为了从 GPU 访问系统分配的内存,GPU 会使用 CPU 为请求的内存创建的任何页表条目。如果该页表条目使用默认的 CPU 页大小为 4KiB 或 64KiB,则对大虚拟内存区域的访问将导致显着的 TLB 未命中,从而导致显着的减速。

See the section on configuring huge pages for examples on how to ensure System-Allocated Memory uses large enough page sizes to avoid this type of issue.
请参阅有关配置大页的部分,了解如何确保系统分配的内存使用足够大的页面大小以避免此类问题的示例。

On the other hand, on systems where the CPUs and GPUs each have their own logical page table, different performance tuning aspects should be considered: in order to guarantee coherency, these systems usually use page faults in case a processor accesses a memory address mapped into the physical memory of a different processor. Such a page fault means that:
另一方面,在每个 CPU 和 GPU 都有自己的逻辑页表的系统中,应考虑不同的性能调整方面:为了保证一致性,这些系统通常在处理器访问映射到不同处理器的物理内存的内存地址时使用页面错误。这样的页面错误意味着:

  • it needs to be ensured that the currently owning processor (where the physical page currently resides) cannot access this page anymore, either by deleting the page table entry or updating it.
    需要确保当前拥有处理器(物理页面当前所在的位置)不能再访问此页面,可以通过删除页面表项或更新页面表项来实现。

  • it needs to be ensured that the processor requesting access can access this page, either by creating a new page table entry or updating and existing entry, such that it becomes valid/active.
    需要确保请求访问的处理器可以访问此页面,可以通过创建新的页面表项或更新现有表项来使其有效/激活。

  • the physical page backing this virtual page must be moved/migrated to the processor requesting access: this can be an expensive operation, and the amount of work is proportional to the page size.
    支持此虚拟页面的物理页面必须移动/迁移到请求访问的处理器:这可能是一个昂贵的操作,工作量与页面大小成正比。

Overall, hardware coherent systems provide significant performance benefits compared to software coherent systems in cases where frequent concurrent accesses to the same memory page are made by both CPU and GPU threads:
总的来说,在 CPU 和 GPU 线程频繁同时访问同一内存页面的情况下,硬件一致性系统相比软件一致性系统提供了显著的性能优势

  • less page-faults: these systems do not need to use page-faults for emulating coherency or migrating memory,
    减少页面错误:这些系统不需要使用页面错误来模拟一致性或迁移内存

  • less contention: these systems are coherent at cache-line granularity instead of page-size granularity, that is, when there is contention from multiple processors within a cache line, only the cache line is exchanged which is much smaller than the smallest page-size, and when the different processors access different cache-lines within a page, then there is no contention.
    争用更少:这些系统在缓存行粒度上是一致的,而不是页面大小的粒度,也就是说,当多个处理器在一个缓存行内发生争用时,只有缓存行被交换,这比最小页面大小要小得多,当不同处理器访问页面内的不同缓存行时,就不会发生争用。

This impacts the performance of the following scenarios:
这会影响以下场景的性能:

  • Atomic updates to the same address concurrently from both CPUs and GPUs.
    同时从 CPU 和 GPU 并发地对同一地址进行原子更新。

  • Signaling a GPU thread from a CPU thread or vice-versa.
    从 CPU 线程向 GPU 线程发出信号,或者反之亦然。

19.2.2.2. Direct Unified Memory Access from host
19.2.2.2. 从主机直接统一内存访问 

Some devices have hardware support for coherent reads, stores and atomic accesses from the host on GPU-resident unified memory. These devices have the attribute cudaDevAttrDirectManagedMemAccessFromHost set to 1. Note that all hardware coherent systems have this attribute set for NVLink-connected devices. On these systems, the host has direct access to GPU-resident memory without page faults and data migration (see Data Usage Hints for more details on memory usage hints). Note that with CUDA Managed Memory, the cudaMemAdviseSetAccessedBy hint with cudaCpuDeviceId is necessary to enable this direct access without page faults.
某些设备具有对 GPU 驻留统一内存上的主机进行一致读取、存储和原子访问的硬件支持。这些设备的属性 cudaDevAttrDirectManagedMemAccessFromHost 设置为 1。请注意,所有硬件一致性系统对于 NVLink 连接的设备都设置了此属性。在这些系统上,主机可以直接访问 GPU 驻留内存,无需页面故障和数据迁移(有关内存使用提示的更多详细信息,请参阅数据使用提示)。请注意,使用 CUDA 托管内存时,需要使用 cudaMemAdviseSetAccessedBy 提示和 cudaCpuDeviceId 以启用此直接访问而无需页面故障。

Consider an example code below:
考虑下面的示例代码:

__global__ void write(int *ret, int a, int b) {
  ret[threadIdx.x] = a + b + threadIdx.x;
}

__global__ void append(int *ret, int a, int b) {
  ret[threadIdx.x] += a + b + threadIdx.x;
}
void test_malloc() {
  int *ret = (int*)malloc(1000 * sizeof(int));
  // for shared page table systems, the following hint is not necesary
  cudaMemAdvise(ret, 1000 * sizeof(int), cudaMemAdviseSetAccessedBy, cudaCpuDeviceId);

  write<<< 1, 1000 >>>(ret, 10, 100);            // pages populated in GPU memory
  cudaDeviceSynchronize();
  for(int i = 0; i < 1000; i++)
      printf("%d: A+B = %d\n", i, ret[i]);        // directManagedMemAccessFromHost=1: CPU accesses GPU memory directly without migrations
                                                  // directManagedMemAccessFromHost=0: CPU faults and triggers device-to-host migrations
  append<<< 1, 1000 >>>(ret, 10, 100);            // directManagedMemAccessFromHost=1: GPU accesses GPU memory without migrations
  cudaDeviceSynchronize();                        // directManagedMemAccessFromHost=0: GPU faults and triggers host-to-device migrations
  free(ret);
}
__global__ void write(int *ret, int a, int b) {
  ret[threadIdx.x] = a + b + threadIdx.x;
}

__global__ void append(int *ret, int a, int b) {
  ret[threadIdx.x] += a + b + threadIdx.x;
}

void test_managed() {
  int *ret;
  cudaMallocManaged(&ret, 1000 * sizeof(int));
  cudaMemAdvise(ret, 1000 * sizeof(int), cudaMemAdviseSetAccessedBy, cudaCpuDeviceId);  // set direct access hint

  write<<< 1, 1000 >>>(ret, 10, 100);            // pages populated in GPU memory
  cudaDeviceSynchronize();
  for(int i = 0; i < 1000; i++)
      printf("%d: A+B = %d\n", i, ret[i]);        // directManagedMemAccessFromHost=1: CPU accesses GPU memory directly without migrations
                                                  // directManagedMemAccessFromHost=0: CPU faults and triggers device-to-host migrations
  append<<< 1, 1000 >>>(ret, 10, 100);            // directManagedMemAccessFromHost=1: GPU accesses GPU memory without migrations
  cudaDeviceSynchronize();                        // directManagedMemAccessFromHost=0: GPU faults and triggers host-to-device migrations
  cudaFree(ret); 
}

After write kernel is completed, ret will be created and initialized in GPU memory. Next, the CPU will access ret followed by append kernel using the same ret memory again. This code will show different behavior depending on the system architecture and support of hardware coherency:
完成 write 内核后,将在 GPU 内存中创建和初始化 ret 。接下来,CPU 将再次访问 ret ,然后使用相同的 ret 内存执行 append 内核。此代码将根据系统架构和硬件一致性支持显示不同行为。

  • On systems with directManagedMemAccessFromHost=1: CPU accesses to the managed buffer will not trigger any migrations; the data will remain resident in GPU memory and any subsequent GPU kernels can continue to access it directly without inflicting faults or migrations.
    在具有 directManagedMemAccessFromHost=1 的系统上:对托管缓冲区的 CPU 访问不会触发任何迁移;数据将保留在 GPU 内存中,任何后续的 GPU 内核都可以继续直接访问它,而不会造成故障或迁移。

  • On systems with directManagedMemAccessFromHost=0: CPU accesses to the managed buffer will page fault and initiate data migration; any GPU kernel trying to access the same data first time will page fault and migrate pages back to GPU memory.
    在具有 directManagedMemAccessFromHost=0 的系统上:对托管缓冲区的 CPU 访问将导致页面错误并启动数据迁移;任何尝试首次访问相同数据的 GPU 内核将导致页面错误并将页面迁移回 GPU 内存。

19.2.2.3. Host Native Atomics
19.2.2.3. 主机原生原子操作 

Some devices, including NVLink-connected devices in hardware coherent systems, support hardware-accelerated atomic accesses to CPU-resident memory. This implies that atomic accesses to host memory do not have to be emulated with a page fault. For these devices, the attribute cudaDevAttrHostNativeAtomicSupported is set to 1.
一些设备,包括硬件一致性系统中的 NVLink 连接设备,支持对 CPU 驻留内存进行硬件加速的原子访问。这意味着对主机内存的原子访问不必使用页面错误来模拟。对于这些设备,属性 cudaDevAttrHostNativeAtomicSupported 设置为 1。

19.3. Unified memory on devices without full CUDA Unified Memory support
19.3. 在没有完整 CUDA 统一内存支持的设备上的统一内存

19.3.1. Unified memory on devices with only CUDA Managed Memory support
19.3.1. 仅支持 CUDA 托管内存的设备上的统一内存 

For devices with compute capability 6.x or higher but without pageable memory access, CUDA Managed Memory is fully supported and coherent. The programming model and performance tuning of unified memory is largely similar to the model as described in Unified memory on devices with full CUDA Unified Memory support, with the notable exception that system allocators cannot be used to allocate memory. Thus, the following list of sub-sections do not apply:
对于具有计算能力 6.x 或更高但没有分页内存访问的设备,CUDA 管理内存得到完全支持并保持一致。统一内存的编程模型和性能调优与完全支持 CUDA 统一内存的设备上描述的模型非常相似,但有一个显著的例外,即系统分配器不能用于分配内存。因此,以下子部分列表不适用:

19.3.2. Unified memory on Windows or devices with compute capability 5.x
19.3.2. 在 Windows 或具有计算能力 5.x 的设备上的统一内存 

Devices with compute capability lower than 6.0 or Windows platforms support CUDA Managed Memory v1.0 with limited support for data migration and coherency as well as memory oversubscription. The following sub-sections describe in more detail how to use and optimize Managed Memory on these platforms.
计算能力低于 6.0 的设备或 Windows 平台支持 CUDA 托管内存 v1.0,具有有限的数据迁移和一致性支持,以及内存超额分配。以下各小节详细描述了如何在这些平台上使用和优化托管内存。

19.3.2.1. Data Migration and Coherency
19.3.2.1. 数据迁移和一致性 

GPU architectures of compute capability lower than 6.0 do not support fine-grained movement of the managed data to GPU on-demand. Whenever a GPU kernel is launched all managed memory generally has to be transferred to GPU memory to avoid faulting on memory access. With compute capability 6.x a new GPU page faulting mechanism is introduced that provides more seamless Unified Memory functionality. Combined with the system-wide virtual address space, page faulting provides several benefits. First, page faulting means that the CUDA system software doesn’t need to synchronize all managed memory allocations to the GPU before each kernel launch. If a kernel running on the GPU accesses a page that is not resident in its memory, it faults, allowing the page to be automatically migrated to the GPU memory on-demand. Alternatively, the page may be mapped into the GPU address space for access over the PCIe or NVLink interconnects (mapping on access can sometimes be faster than migration). Note that Unified Memory is system-wide: GPUs (and CPUs) can fault on and migrate memory pages either from CPU memory or from the memory of other GPUs in the system.
计算能力低于 6.0 的 GPU 架构不支持将托管数据细粒度地按需移动到 GPU。每当启动 GPU 内核时,通常需要将所有托管内存传输到 GPU 内存,以避免在内存访问时出现故障。具有计算能力 6.x 的新 GPU 页面故障机制引入了更无缝的统一内存功能。结合系统范围的虚拟地址空间,页面故障提供了几个好处。首先,页面故障意味着 CUDA 系统软件不需要在每次内核启动之前将所有托管内存分配同步到 GPU。如果在 GPU 上运行的内核访问一个不在其内存中的页面,它会出现故障,从而允许页面按需自动迁移到 GPU 内存。或者,该页面可以映射到 GPU 地址空间,以通过 PCIe 或 NVLink 互连进行访问(在访问时映射有时比迁移更快)。请注意,统一内存是系统范围的:GPU(和 CPU)可以从 CPU 内存或系统中其他 GPU 的内存中故障并迁移内存页面。

19.3.2.2. GPU Memory Oversubscription
19.3.2.2. GPU 内存超额分配 

Devices of compute capability lower than 6.0 cannot allocate more managed memory than the physical size of GPU memory.
计算能力低于 6.0 的设备无法分配比 GPU 内存物理大小更多的托管内存。

19.3.2.3. Multi-GPU 19.3.2.3. 多 GPU 

On systems with devices of compute capabilities lower than 6.0 managed allocations are automatically visible to all GPUs in a system via the peer-to-peer capabilities of the GPUs. Managed memory allocations behave similar to unmanaged memory allocated using cudaMalloc(): the current active device is the home for the physical allocation but other GPUs in the system will access the memory at reduced bandwidth over the PCIe bus.
在计算能力低于 6.0 的系统上,通过 GPU 的点对点功能,受管理的分配会自动对系统中所有 GPU 可见。 受管理的内存分配的行为类似于使用 cudaMalloc() 分配的未受管理内存:当前活动设备是物理分配的主机,但系统中的其他 GPU 将通过 PCIe 总线以降低的带宽访问内存。

On Linux the managed memory is allocated in GPU memory as long as all GPUs that are actively being used by a program have the peer-to-peer support. If at any time the application starts using a GPU that doesn’t have peer-to-peer support with any of the other GPUs that have managed allocations on them, then the driver will migrate all managed allocations to system memory. In this case, all GPUs experience PCIe bandwidth restrictions.
在 Linux 上,只要程序正在使用的所有 GPU 都支持点对点通信,托管内存就会分配到 GPU 内存中。如果应用程序开始使用任何一个不支持与其他具有托管分配的 GPU 进行点对点通信的 GPU,那么驱动程序将会将所有托管分配迁移到系统内存。在这种情况下,所有 GPU 都会受到 PCIe 带宽限制。

On Windows, if peer mappings are not available (for example, between GPUs of different architectures), then the system will automatically fall back to using zero-copy memory, regardless of whether both GPUs are actually used by a program. If only one GPU is actually going to be used, it is necessary to set the CUDA_VISIBLE_DEVICES environment variable before launching the program. This constrains which GPUs are visible and allows managed memory to be allocated in GPU memory.
在 Windows 上,如果对等映射不可用(例如,在不同架构的 GPU 之间),那么系统将自动回退到使用零拷贝内存,无论程序是否实际使用了两个 GPU。如果只有一个 GPU 实际上会被使用,必须在启动程序之前设置 CUDA_VISIBLE_DEVICES 环境变量。这将限制可见的 GPU,并允许在 GPU 内存中分配托管内存。

Alternatively, on Windows users can also set CUDA_MANAGED_FORCE_DEVICE_ALLOC to a non-zero value to force the driver to always use device memory for physical storage. When this environment variable is set to a non-zero value, all devices used in that process that support managed memory have to be peer-to-peer compatible with each other. The error ::cudaErrorInvalidDevice will be returned if a device that supports managed memory is used and it is not peer-to-peer compatible with any of the other managed memory supporting devices that were previously used in that process, even if ::cudaDeviceReset has been called on those devices. These environment variables are described in CUDA Environment Variables. Note that starting from CUDA 8.0 CUDA_MANAGED_FORCE_DEVICE_ALLOC has no effect on Linux operating systems.
或者,在 Windows 上,用户还可以将 CUDA_MANAGED_FORCE_DEVICE_ALLOC 设置为非零值,以强制驱动程序始终使用设备内存进行物理存储。当此环境变量设置为非零值时,在该进程中使用的所有支持托管内存的设备都必须彼此支持点对点通信。如果使用支持托管内存的设备,并且它与该进程中先前使用的任何其他支持托管内存的设备不兼容,则将返回错误 ::cudaErrorInvalidDevice ,即使在这些设备上调用了 ::cudaDeviceReset 。这些环境变量在 CUDA 环境变量中有描述。请注意,从 CUDA 8.0 开始, CUDA_MANAGED_FORCE_DEVICE_ALLOC 对 Linux 操作系统没有影响。

19.3.2.4. Coherency and Concurrency
19.3.2.4. 一致性和并发性 

Simultaneous access to managed memory on devices of compute capability lower than 6.0 is not possible, because coherence could not be guaranteed if the CPU accessed a Unified Memory allocation while a GPU kernel was active.
设备的计算能力低于 6.0 的情况下,无法同时访问托管内存,因为如果 CPU 访问统一内存分配时 GPU 内核正在运行,无法保证一致性。

19.3.2.4.1. GPU Exclusive Access To Managed Memory
19.3.2.4.1. GPU 对托管内存的独占访问 

To ensure coherency on pre-6.x GPU architectures, the Unified Memory programming model puts constraints on data accesses while both the CPU and GPU are executing concurrently. In effect, the GPU has exclusive access to all managed data while any kernel operation is executing, regardless of whether the specific kernel is actively using the data. When managed data is used with cudaMemcpy*() or cudaMemset*(), the system may choose to access the source or destination from the host or the device, which will put constraints on concurrent CPU access to that data while the cudaMemcpy*() or cudaMemset*() is executing. See Memcpy()/Memset() Behavior With Managed Memory for further details.
为确保在 6.x 之前的 GPU 架构上的一致性,统一内存编程模型对数据访问施加了约束,同时 CPU 和 GPU 同时执行。实际上,GPU 在执行任何内核操作时对所有托管数据具有独占访问权,无论特定内核是否正在使用数据。当使用托管数据与 cudaMemcpy*()cudaMemset*() 时,系统可能选择从主机或设备访问源或目的地,这将对 cudaMemcpy*()cudaMemset*() 执行时的并发 CPU 访问数据施加约束。有关详细信息,请参阅使用托管内存的 Memcpy()/Memset()行为。

It is not permitted for the CPU to access any managed allocations or variables while the GPU is active for devices with concurrentManagedAccess property set to 0. On these systems concurrent CPU/GPU accesses, even to different managed memory allocations, will cause a segmentation fault because the page is considered inaccessible to the CPU.
CPU 不允许在 GPU 处于活动状态时访问任何受管分配或变量,对于属性设置为 0 的设备。在这些系统上,即使是对不同的受管内存分配,同时进行 CPU/GPU 访问也会导致分段错误,因为页面被视为对 CPU 不可访问。

__device__ __managed__ int x, y=2;
__global__  void  kernel() {
    x = 10;
}
int main() {
    kernel<<< 1, 1 >>>();
    y = 20;            // Error on GPUs not supporting concurrent access

    cudaDeviceSynchronize();
    return  0;
}

In example above, the GPU program kernel is still active when the CPU touches y. (Note how it occurs before cudaDeviceSynchronize().) The code runs successfully on devices of compute capability 6.x due to the GPU page faulting capability which lifts all restrictions on simultaneous access. However, such memory access is invalid on pre-6.x architectures even though the CPU is accessing different data than the GPU. The program must explicitly synchronize with the GPU before accessing y:
在上面的示例中,当 CPU 触及 y 时,GPU 程序 kernel 仍然处于活动状态。(请注意它发生在 cudaDeviceSynchronize() 之前。)由于 GPU 的页面错误能力解除了对同时访问的所有限制,所以该代码在计算能力为 6.x 的设备上成功运行。然而,在 6.x 之前的架构上,即使 CPU 访问的数据与 GPU 不同,这样的内存访问也是无效的。程序在访问 y 之前必须显式与 GPU 同步:

__device__ __managed__ int x, y=2;
__global__  void  kernel() {
    x = 10;
}
int main() {
    kernel<<< 1, 1 >>>();
    cudaDeviceSynchronize();
    y = 20;            //  Success on GPUs not supporing concurrent access
    return  0;
}

As this example shows, on systems with pre-6.x GPU architectures, a CPU thread may not access any managed data in between performing a kernel launch and a subsequent synchronization call, regardless of whether the GPU kernel actually touches that same data (or any managed data at all). The mere potential for concurrent CPU and GPU access is sufficient for a process-level exception to be raised.
正如此示例所示,在具有预 6.x GPU 架构的系统上,CPU 线程在执行内核启动和后续同步调用之间可能无法访问任何托管数据,无论 GPU 内核是否实际触及相同数据(或任何托管数据)。仅仅存在 CPU 和 GPU 访问的潜在并发性就足以引发进程级异常。

Note that if memory is dynamically allocated with cudaMallocManaged() or cuMemAllocManaged() while the GPU is active, the behavior of the memory is unspecified until additional work is launched or the GPU is synchronized. Attempting to access the memory on the CPU during this time may or may not cause a segmentation fault. This does not apply to memory allocated using the flag cudaMemAttachHost or CU_MEM_ATTACH_HOST.
请注意,如果在 GPU 处于活动状态时使用 cudaMallocManaged()cuMemAllocManaged() 动态分配内存,则在启动额外工作或同步 GPU 之前,内存的行为是未指定的。在此期间尝试在 CPU 上访问内存可能会导致分段错误,也可能不会。这不适用于使用标志 cudaMemAttachHostCU_MEM_ATTACH_HOST 分配的内存。

19.3.2.4.2. Explicit Synchronization and Logical GPU Activity
19.3.2.4.2. 显式同步和逻辑 GPU 活动 

Note that explicit synchronization is required even if kernel runs quickly and finishes before the CPU touches y in the above example. Unified Memory uses logical activity to determine whether the GPU is idle. This aligns with the CUDA programming model, which specifies that a kernel can run at any time following a launch and is not guaranteed to have finished until the host issues a synchronization call.
请注意,即使在上面的示例中 kernel 运行速度很快并在 CPU 触及 y 之前完成,也需要显式同步。统一内存使用逻辑活动来确定 GPU 是否空闲。这符合 CUDA 编程模型,该模型指定内核可以在启动后的任何时间运行,并且在主机发出同步调用之前不能保证已经完成。

Any function call that logically guarantees the GPU completes its work is valid. This includes cudaDeviceSynchronize(); cudaStreamSynchronize() and cudaStreamQuery() (provided it returns cudaSuccess and not cudaErrorNotReady) where the specified stream is the only stream still executing on the GPU; cudaEventSynchronize() and cudaEventQuery() in cases where the specified event is not followed by any device work; as well as uses of cudaMemcpy() and cudaMemset() that are documented as being fully synchronous with respect to the host.
任何逻辑上保证 GPU 完成其工作的函数调用都是有效的。这包括 cudaDeviceSynchronize()cudaStreamSynchronize()cudaStreamQuery() (前提是它返回 cudaSuccess 而不是 cudaErrorNotReady ),其中指定的流是 GPU 上仍在执行的唯一流;在指定事件后没有任何设备工作跟随的情况下的 cudaEventSynchronize()cudaEventQuery() ;以及被记录为与主机完全同步的 cudaMemcpy()cudaMemset() 的使用。

Dependencies created between streams will be followed to infer completion of other streams by synchronizing on a stream or event. Dependencies can be created via cudaStreamWaitEvent() or implicitly when using the default (NULL) stream.
在流之间创建的依赖关系将通过在流或事件上同步来推断其他流的完成。可以通过 cudaStreamWaitEvent() 创建依赖关系,也可以在使用默认(NULL)流时隐式创建依赖关系。

It is legal for the CPU to access managed data from within a stream callback, provided no other stream that could potentially be accessing managed data is active on the GPU. In addition, a callback that is not followed by any device work can be used for synchronization: for example, by signaling a condition variable from inside the callback; otherwise, CPU access is valid only for the duration of the callback(s).
CPU 可以在流回调内访问托管数据是合法的,前提是 GPU 上没有其他可能访问托管数据的流处于活动状态。此外,未跟随任何设备工作的回调可用于同步:例如,通过在回调内部从条件变量发出信号;否则,CPU 访问仅在回调期间有效。

There are several important points of note:
有几个重要注意事项:

  • It is always permitted for the CPU to access non-managed zero-copy data while the GPU is active.
    当 GPU 处于活动状态时,CPU 始终允许访问非托管零拷贝数据。

  • The GPU is considered active when it is running any kernel, even if that kernel does not make use of managed data. If a kernel might use data, then access is forbidden, unless device property concurrentManagedAccess is 1.
    当 GPU 运行任何内核时,即使该内核不使用托管数据,也被视为活动。如果内核可能使用数据,则访问是被禁止的,除非设备属性 concurrentManagedAccess 为 1。

  • There are no constraints on concurrent inter-GPU access of managed memory, other than those that apply to multi-GPU access of non-managed memory.
    对托管内存的并发 GPU 访问没有任何限制,除了适用于非托管内存的多 GPU 访问的限制。

  • There are no constraints on concurrent GPU kernels accessing managed data.
    并发 GPU 内核访问托管数据时没有任何约束。

Note how the last point allows for races between GPU kernels, as is currently the case for non-managed GPU memory. As mentioned previously, managed memory functions identically to non-managed memory from the perspective of the GPU. The following code example illustrates these points:
请注意最后一点允许 GPU 内核之间的竞争,就像目前对于非托管 GPU 内存的情况一样。如前所述,从 GPU 的角度看,托管内存的功能与非托管内存完全相同。以下代码示例说明了这些要点:

int main() {
    cudaStream_t stream1, stream2;
    cudaStreamCreate(&stream1);
    cudaStreamCreate(&stream2);
    int *non_managed, *managed, *also_managed;
    cudaMallocHost(&non_managed, 4);    // Non-managed, CPU-accessible memory
    cudaMallocManaged(&managed, 4);
    cudaMallocManaged(&also_managed, 4);
    // Point 1: CPU can access non-managed data.
    kernel<<< 1, 1, 0, stream1 >>>(managed);
    *non_managed = 1;
    // Point 2: CPU cannot access any managed data while GPU is busy,
    //          unless concurrentManagedAccess = 1
    // Note we have not yet synchronized, so "kernel" is still active.
    *also_managed = 2;      // Will issue segmentation fault
    // Point 3: Concurrent GPU kernels can access the same data.
    kernel<<< 1, 1, 0, stream2 >>>(managed);
    // Point 4: Multi-GPU concurrent access is also permitted.
    cudaSetDevice(1);
    kernel<<< 1, 1 >>>(managed);
    return  0;
}
19.3.2.4.3. Managing Data Visibility and Concurrent CPU + GPU Access with Streams
19.3.2.4.3. 使用流管理数据可见性和并发 CPU + GPU 访问 

Until now it was assumed that for SM architectures before 6.x: 1) any active kernel may use any managed memory, and 2) it was invalid to use managed memory from the CPU while a kernel is active. Here we present a system for finer-grained control of managed memory designed to work on all devices supporting managed memory, including older architectures with concurrentManagedAccess equal to 0.
直到现在,人们一直认为在 6.x 之前的 SM 架构中:1)任何活动内核都可以使用任何托管内存,2)在内核活动时使用 CPU 中的托管内存是无效的。在这里,我们提出了一个用于更精细控制托管内存的系统,旨在适用于所有支持托管内存的设备,包括将 concurrentManagedAccess 等于 0 的旧架构。

The CUDA programming model provides streams as a mechanism for programs to indicate dependence and independence among kernel launches. Kernels launched into the same stream are guaranteed to execute consecutively, while kernels launched into different streams are permitted to execute concurrently. Streams describe independence between work items and hence allow potentially greater efficiency through concurrency.
CUDA 编程模型提供流作为程序指示内核启动之间依赖性和独立性的机制。启动到同一流中的内核保证按顺序执行,而启动到不同流中的内核允许并发执行。流描述工作项之间的独立性,因此通过并发性可能实现更高效率。

Unified Memory builds upon the stream-independence model by allowing a CUDA program to explicitly associate managed allocations with a CUDA stream. In this way, the programmer indicates the use of data by kernels based on whether they are launched into a specified stream or not. This enables opportunities for concurrency based on program-specific data access patterns. The function to control this behavior is:
统一内存建立在流独立模型的基础上,允许 CUDA 程序将托管分配明确关联到 CUDA 流。程序员通过这种方式指示内核根据它们是否启动到指定流中来使用数据。这样可以根据程序特定的数据访问模式实现并发性。控制此行为的函数是:

cudaError_t cudaStreamAttachMemAsync(cudaStream_t stream,
                                     void *ptr,
                                     size_t length=0,
                                     unsigned int flags=0);

The cudaStreamAttachMemAsync() function associates length bytes of memory starting from ptr with the specified stream. (Currently, length must always be 0 to indicate that the entire region should be attached.) Because of this association, the Unified Memory system allows CPU access to this memory region so long as all operations in stream have completed, regardless of whether other streams are active. In effect, this constrains exclusive ownership of the managed memory region by an active GPU to per-stream activity instead of whole-GPU activity.
cudaStreamAttachMemAsync() 函数将从 ptr 开始的 length 字节内存与指定的 stream 关联起来。(目前, length 必须始终为 0,以指示整个区域应该被附加。)由于这种关联,统一内存系统允许 CPU 访问这个内存区域,只要 stream 中的所有操作都已完成,而不管其他流是否活动。实际上,这限制了活动 GPU 对托管内存区域的独占所有权,使其限于每个流的活动,而不是整个 GPU 的活动。

Most importantly, if an allocation is not associated with a specific stream, it is visible to all running kernels regardless of their stream. This is the default visibility for a cudaMallocManaged() allocation or a __managed__ variable; hence, the simple-case rule that the CPU may not touch the data while any kernel is running.
最重要的是,如果一个分配没有与特定流相关联,那么它对所有正在运行的内核都是可见的,而不管它们的流是什么。这是 cudaMallocManaged() 分配或 __managed__ 变量的默认可见性;因此,CPU 在任何内核运行时可能不会触及数据的简单规则。

By associating an allocation with a specific stream, the program makes a guarantee that only kernels launched into that stream will touch that data. No error checking is performed by the Unified Memory system: it is the programmer’s responsibility to ensure that guarantee is honored.
通过将分配与特定流关联起来,程序保证只有启动到该流的内核才会访问该数据。统一内存系统不执行任何错误检查:程序员有责任确保该保证得到遵守。

In addition to allowing greater concurrency, the use of cudaStreamAttachMemAsync() can (and typically does) enable data transfer optimizations within the Unified Memory system that may affect latencies and other overhead.
除了允许更大的并发性外,使用 cudaStreamAttachMemAsync() 还可以(通常会)在统一内存系统内启用数据传输优化,可能会影响延迟和其他开销。

19.3.2.4.4. Stream Association Examples
19.3.2.4.4. 流关联示例 

Associating data with a stream allows fine-grained control over CPU + GPU concurrency, but what data is visible to which streams must be kept in mind when using devices of compute capability lower than 6.0. Looking at the earlier synchronization example:
将数据与流关联允许对 CPU + GPU 并发进行细粒度控制,但在使用计算能力低于 6.0 的设备时,必须牢记哪些数据对哪些流可见。查看之前的同步示例:

__device__ __managed__ int x, y=2;
__global__  void  kernel() {
    x = 10;
}
int main() {
    cudaStream_t stream1;
    cudaStreamCreate(&stream1);
    cudaStreamAttachMemAsync(stream1, &y, 0, cudaMemAttachHost);
    cudaDeviceSynchronize();          // Wait for Host attachment to occur.
    kernel<<< 1, 1, 0, stream1 >>>(); // Note: Launches into stream1.
    y = 20;                           // Success – a kernel is running but “y”
                                      // has been associated with no stream.
    return  0;
}

Here we explicitly associate y with host accessibility, thus enabling access at all times from the CPU. (As before, note the absence of cudaDeviceSynchronize() before the access.) Accesses to y by the GPU running kernel will now produce undefined results.
在这里,我们明确将 y 与主机可访问性关联起来,从而使 CPU 始终可以访问。(与之前一样,请注意在访问之前 cudaDeviceSynchronize() 的缺失。)GPU 运行 kernel 时对 y 的访问现在将产生未定义的结果。

Note that associating a variable with a stream does not change the associating of any other variable. For example, associating x with stream1 does not ensure that only x is accessed by kernels launched in stream1, thus an error is caused by this code:
请注意,将变量与流关联并不会改变任何其他变量的关联。例如,将 xstream1 关联并不会确保只有 x 被在 stream1 中启动的内核访问,因此此代码会导致错误:

__device__ __managed__ int x, y=2;
__global__  void  kernel() {
    x = 10;
}
int main() {
    cudaStream_t stream1;
    cudaStreamCreate(&stream1);
    cudaStreamAttachMemAsync(stream1, &x);// Associate “x” with stream1.
    cudaDeviceSynchronize();              // Wait for “x” attachment to occur.
    kernel<<< 1, 1, 0, stream1 >>>();     // Note: Launches into stream1.
    y = 20;                               // ERROR: “y” is still associated globally
                                          // with all streams by default
    return  0;
}

Note how the access to y will cause an error because, even though x has been associated with a stream, we have told the system nothing about who can see y. The system therefore conservatively assumes that kernel might access it and prevents the CPU from doing so.
请注意,访问 y 将会导致错误,因为即使 x 已经与流关联,我们并没有告诉系统任何关于谁可以看到 y 的信息。因此,系统保守地假设 kernel 可能会访问它,并阻止 CPU 这样做。

19.3.2.4.5. Stream Attach With Multithreaded Host Programs
19.3.2.4.5. 使用多线程主机程序附加流 

The primary use for cudaStreamAttachMemAsync() is to enable independent task parallelism using CPU threads. Typically in such a program, a CPU thread creates its own stream for all work that it generates because using CUDA’s NULL stream would cause dependencies between threads.
cudaStreamAttachMemAsync() 的主要用途是使用 CPU 线程实现独立任务并行。通常在这样的程序中,CPU 线程为其生成的所有工作创建自己的流,因为使用 CUDA 的 NULL 流会导致线程之间存在依赖关系。

The default global visibility of managed data to any GPU stream can make it difficult to avoid interactions between CPU threads in a multi-threaded program. Function cudaStreamAttachMemAsync() is therefore used to associate a thread’s managed allocations with that thread’s own stream, and the association is typically not changed for the life of the thread.
受管理数据对任何 GPU 流的默认全局可见性可能会使多线程程序中的 CPU 线程之间的交互难以避免。因此,函数 cudaStreamAttachMemAsync() 用于将线程的受管理分配与该线程自己的流关联起来,并且通常在线程的生命周期内不会更改此关联。

Such a program would simply add a single call to cudaStreamAttachMemAsync() to use unified memory for its data accesses:
这样一个程序只需添加一个对 cudaStreamAttachMemAsync() 的调用,即可使用统一内存来访问其数据:

// This function performs some task, in its own private stream.
void run_task(int *in, int *out, int length) {
    // Create a stream for us to use.
    cudaStream_t stream;
    cudaStreamCreate(&stream);
    // Allocate some managed data and associate with our stream.
    // Note the use of the host-attach flag to cudaMallocManaged();
    // we then associate the allocation with our stream so that
    // our GPU kernel launches can access it.
    int *data;
    cudaMallocManaged((void **)&data, length, cudaMemAttachHost);
    cudaStreamAttachMemAsync(stream, data);
    cudaStreamSynchronize(stream);
    // Iterate on the data in some way, using both Host & Device.
    for(int i=0; i<N; i++) {
        transform<<< 100, 256, 0, stream >>>(in, data, length);
        cudaStreamSynchronize(stream);
        host_process(data, length);    // CPU uses managed data.
        convert<<< 100, 256, 0, stream >>>(out, data, length);
    }
    cudaStreamSynchronize(stream);
    cudaStreamDestroy(stream);
    cudaFree(data);
}

In this example, the allocation-stream association is established just once, and then data is used repeatedly by both the host and device. The result is much simpler code than occurs with explicitly copying data between host and device, although the result is the same.
在这个示例中,分配流关联只建立一次,然后 data 被主机和设备重复使用。与在主机和设备之间明确复制数据相比,结果是更简单的代码,尽管结果是相同的。

19.3.2.4.6. Advanced Topic: Modular Programs and Data Access Constraints
19.3.2.4.6. 高级主题:模块化程序和数据访问约束 

In the previous example cudaMallocManaged() specifies the cudaMemAttachHost flag, which creates an allocation that is initially invisible to device-side execution. (The default allocation would be visible to all GPU kernels on all streams.) This ensures that there is no accidental interaction with another thread’s execution in the interval between the data allocation and when the data is acquired for a specific stream.
在前面的示例中, cudaMallocManaged() 指定了 cudaMemAttachHost 标志,该标志创建了一个最初对设备端执行不可见的分配。(默认分配将对所有流上的所有 GPU 内核可见。)这确保在数据分配和为特定流获取数据之间的时间间隔内不会发生与另一个线程执行的意外交互。

Without this flag, a new allocation would be considered in-use on the GPU if a kernel launched by another thread happens to be running. This might impact the thread’s ability to access the newly allocated data from the CPU (for example, within a base-class constructor) before it is able to explicitly attach it to a private stream. To enable safe independence between threads, therefore, allocations should be made specifying this flag.
如果没有此标志,如果另一个线程启动的内核恰好正在运行,则 GPU 上会认为新分配正在使用中。这可能会影响线程在能够将新分配的数据显式附加到私有流之前(例如,在基类构造函数中)访问该数据的能力。因此,为了在线程之间实现安全独立,应该在进行分配时指定此标志。

Note 注意

An alternative would be to place a process-wide barrier across all threads after the allocation has been attached to the stream. This would ensure that all threads complete their data/stream associations before any kernels are launched, avoiding the hazard. A second barrier would be needed before the stream is destroyed because stream destruction causes allocations to revert to their default visibility. The cudaMemAttachHost flag exists both to simplify this process, and because it is not always possible to insert global barriers where required.
另一种选择是在将分配附加到流之后,在所有线程之间放置一个全局屏障。这将确保所有线程在启动任何内核之前完成其数据/流关联,从而避免危险。在销毁流之前需要第二个屏障,因为流的销毁会导致分配恢复到其默认可见性。 cudaMemAttachHost 标志的存在既简化了这个过程,也因为在必要时不总是可能插入全局屏障。

19.3.2.4.7. Memcpy()/Memset() Behavior With Stream-associated Unified Memory
19.3.2.4.7. 带有流关联的统一内存的 Memcpy()/Memset()行为 

See Memcpy()/Memset() Behavior With Unified Memory for a general overview of cudaMemcpy* / cudaMemset* behavior on devices with concurrentManagedAccess set. On devices where concurrentManagedAccess is not set, the following rules apply:
查看有关在设置为 concurrentManagedAccess 的设备上 cudaMemcpy* / cudaMemset* 行为的统一内存的 Memcpy()/Memset()行为概述。在未设置 concurrentManagedAccess 的设备上,适用以下规则:

If cudaMemcpyHostTo* is specified and the source data is unified memory, then it will be accessed from the host if it is coherently accessible from the host in the copy stream (1); otherwise it will be accessed from the device. Similar rules apply to the destination when cudaMemcpy*ToHost is specified and the destination is unified memory.
如果指定了 cudaMemcpyHostTo* 并且源数据是统一内存,则如果在复制流(1)中从主机访问主机时,它将从主机访问;否则将从设备访问。当指定了 cudaMemcpy*ToHost 并且目的地是统一内存时,类似的规则适用于目的地。

If cudaMemcpyDeviceTo* is specified and the source data is unified memory, then it will be accessed from the device. The source must be coherently accessible from the device in the copy stream (2); otherwise, an error is returned. Similar rules apply to the destination when cudaMemcpy*ToDevice is specified and the destination is unified memory.
如果指定了 cudaMemcpyDeviceTo* 并且源数据是统一内存,则将从设备访问它。在复制流(2)中,源必须从设备上具有一致性访问权限;否则,将返回错误。当指定了 cudaMemcpy*ToDevice 并且目的地是统一内存时,类似规则适用于目的地。

If cudaMemcpyDefault is specified, then unified memory will be accessed from the host either if it cannot be coherently accessed from the device in the copy stream (2) or if the preferred location for the data is cudaCpuDeviceId and it can be coherently accessed from the host in the copy stream (1); otherwise, it will be accessed from the device.
如果指定了 cudaMemcpyDefault ,则统一内存将从主机访问,如果在复制流(2)中无法从设备一致访问,或者如果数据的首选位置是 cudaCpuDeviceId ,并且在复制流(1)中可以从主机一致访问;否则,将从设备访问。

When using cudaMemset*() with unified memory, the data must be coherently accessible from the device in the stream being used for the cudaMemset() operation (2); otherwise, an error is returned.
当在统一内存中使用 cudaMemset*() 时,数据必须能够从用于 cudaMemset() 操作的流中被设备一致地访问;否则,将返回错误。

When data is accessed from the device either by cudaMemcpy* or cudaMemset*, the stream of operation is considered to be active on the GPU. During this time, any CPU access of data that is associated with that stream or data that has global visibility, will result in a segmentation fault if the GPU has a zero value for the device attribute concurrentManagedAccess. The program must synchronize appropriately to ensure the operation has completed before accessing any associated data from the CPU.
当数据通过 cudaMemcpy*cudaMemset* 从设备访问时,操作流在 GPU 上被视为活动状态。在此期间,如果 GPU 的设备属性 concurrentManagedAccess 的值为零,则与该流相关的数据或具有全局可见性的数据的任何 CPU 访问都将导致分段错误。程序必须适当同步以确保操作在从 CPU 访问任何相关数据之前已完成。

  1. Coherently accessible from the host in a given stream means that the memory neither has global visibility nor is it associated with the given stream.
    在给定流中从主机一致访问的意思是内存既没有全局可见性,也不与给定流相关联。

  1. Coherently accessible from the device in a given stream means that the memory either has global visibility or is associated with the given stream.
    在给定流中从设备一致访问意味着内存具有全局可见性或与给定流相关联。

20. Lazy Loading
20. 懒加载 

20.1. What is Lazy Loading?
20.1. 什么是惰性加载? 

Lazy Loading delays loading of CUDA modules and kernels from program initalization closer to kernels execution. If a program does not use every single kernel it has included, then some kernels will be loaded unneccesarily. This is very common, especially if you include any libraries. Most of the time, programs only use a small amount of kernels from libraries they include.
懒加载将 CUDA 模块和内核的加载延迟到程序初始化接近内核执行的时候。如果程序没有使用包含的每个内核,那么一些内核将被不必要地加载。这是非常常见的,特别是如果包含任何库。大多数情况下,程序只使用包含的库中的少量内核。

Thanks to Lazy Loading, programs are able to only load kernels they are actually going to use, saving time on initialization. This reduces memory overhead, both on GPU memory and host memory.
由于惰性加载,程序能够只加载它们实际要使用的内核,在初始化时节省时间。这减少了内存开销,无论是在 GPU 内存上还是主机内存上。

Lazy Loading is enabled by setting the CUDA_MODULE_LOADING environment variable to LAZY.
懒加载通过将 CUDA_MODULE_LOADING 环境变量设置为 LAZY 来启用。

Firstly, CUDA Runtime will no longer load all modules during program initialization, with the exception of modules containing managed variables. Each module will be loaded on first usage of a variable or a kernel from that module. This optimization is only relevant to CUDA Runtime users, CUDA Driver users who use cuModuleLoad are unaffected. This optimization shipped in CUDA 11.8. The behavior for CUDA Driver users who use cuLibraryLoad to load module data into memory can be changed by setting the CUDA_MODULE_DATA_LOADING environment variable.
首先,CUDA Runtime 在程序初始化期间将不再加载所有模块,除了包含托管变量的模块。每个模块将在首次使用该模块中的变量或内核时加载。这种优化仅适用于 CUDA Runtime 用户,使用 cuModuleLoad 的 CUDA Driver 用户不受影响。这种优化已在 CUDA 11.8 中发布。通过设置 CUDA_MODULE_DATA_LOADING 环境变量,可以更改使用 cuLibraryLoad 将模块数据加载到内存的 CUDA Driver 用户的行为。

Secondly, loading a module (cuModuleLoad*() family of functions) will not be loading kernels immediately, instead it will delay loading of a kernel until cuModuleGetFunction() is called. There are certain exceptions here, some kernels have to be loaded during cuModuleLoad*(), such as kernels of which pointers are stored in global variables. This optimization is relevant to both CUDA Runtime and CUDA Driver users. CUDA Runtime will only call cuModuleGetFunction() when a kernel is used/referenced for the first time. This optimization shipped in CUDA 11.7.
其次,加载模块( cuModuleLoad*() 函数族)不会立即加载内核,而是延迟加载内核直到调用 cuModuleGetFunction() 。这里有一些特殊情况,有些内核必须在 cuModuleLoad*() 期间加载,例如指针存储在全局变量中的内核。这种优化对 CUDA Runtime 和 CUDA Driver 用户都很重要。CUDA Runtime 仅在首次使用/引用内核时才会调用 cuModuleGetFunction() 。这种优化已在 CUDA 11.7 中发布。

Both of these optimizations are designed to be invisible to the user, assuming CUDA Programming Model is followed.
这两种优化设计旨在对用户不可见,假设遵循 CUDA 编程模型。

20.2. Lazy Loading version support
20.2. 惰性加载版本支持 

Lazy Loading is a CUDA Runtime and CUDA Driver feature. Upgrades to both might be necessary to utilize the feature.
懒加载是 CUDA Runtime 和 CUDA Driver 的功能。可能需要升级两者才能使用该功能。

20.2.1. Driver 20.2.1. 驱动程序 

Lazy Loading requires R515+ user-mode library, but it supports Forward Compatibility, meaning it can run on top of older kernel mode drivers.
懒加载需要 R515+用户模式库,但它支持向前兼容性,这意味着它可以在旧的内核模式驱动程序之上运行。

Without R515+ user-mode library, Lazy Loading is not available in any shape or form, even if toolkit version is 11.7+.
没有 R515+ 用户模式库,即使工具包版本为 11.7+,也无法以任何形式使用惰性加载。

20.2.2. Toolkit 20.2.2. 工具包 

Lazy Loading was introduced in CUDA 11.7, and received a significant upgrade in CUDA 11.8.
CUDA 11.7 中引入了延迟加载,并在 CUDA 11.8 中得到了显著升级。

If your application uses CUDA Runtime, then in order to see benefits from Lazy Loading your application must use 11.7+ CUDA Runtime.
如果您的应用程序使用 CUDA Runtime,则为了从延迟加载中获益,您的应用程序必须使用 11.7+ CUDA Runtime。

As CUDA Runtime is usually linked statically into programs and libraries, this means that you have to recompile your program with CUDA 11.7+ toolkit and use CUDA 11.7+ libraries.
由于 CUDA Runtime 通常静态链接到程序和库中,这意味着您必须使用 CUDA 11.7+工具包重新编译您的程序并使用 CUDA 11.7+库。

Otherwise you will not see the benefits of Lazy Loading, even if your driver version supports it.
否则,即使您的驱动程序版本支持它,您也看不到延迟加载的好处。

If only some of your libraries are 11.7+, you will only see benefits of Lazy Loading in those libraries. Other libraries will still load everything eagerly.
如果您的某些库是 11.7+,您将只在这些库中看到延迟加载的好处。其他库仍将急切加载所有内容。

20.2.3. Compiler 20.2.3. 编译器 

Lazy Loading does not require any compiler support. Both SASS and PTX compiled with pre-11.7 compilers can be loaded with Lazy Loading enabled, and will see full benefits of the feature. However, 11.7+ CUDA Runtime is still required, as described above.
懒加载不需要任何编译器支持。使用启用了懒加载的预 11.7 编译器编译的 SASS 和 PTX 可以加载,并将看到该功能的全部好处。但是,仍然需要 11.7+ CUDA Runtime,如上所述。

20.3. Triggering loading of kernels in lazy mode
20.3. 在延迟模式下触发内核加载 

Loading kernels and variables happens automatically, without any need for explicit loading. Simply launching a kernel or referencing a variable or a kernel will automatically load relevant modules and kernels.
内核和变量的加载是自动进行的,无需显式加载。只需启动一个内核或引用一个变量或内核,相关的模块和内核将会自动加载。

However, if for any reason you wish to load a kernel without executing it or modifying it in any way, we recommend the following.
但是,如果出于任何原因,您希望加载内核而不执行它或以任何方式修改它,我们建议执行以下操作。

20.3.1. CUDA Driver API
20.3.1. CUDA 驱动程序 API 

Loading of kernels happens during cuModuleGetFunction() call. This call is necessary even without Lazy Loading, as it is the only way to obtain a kernel handle.
内核的加载发生在 cuModuleGetFunction() 调用期间。即使没有惰性加载,这个调用也是必要的,因为这是获取内核句柄的唯一方式。

However, you can also use this API to control with finer granularity when kernels are loaded.
但是,您也可以使用此 API 以更精细的粒度控制内核加载的时间。

20.3.2. CUDA Runtime API
20.3.2. CUDA Runtime API  CUDA 运行时 API 

CUDA Runtime API manages module management automatically, so we recommend simply using cudaFuncGetAttributes() to reference the kernel.
CUDA Runtime API 自动管理模块管理,因此我们建议简单地使用 cudaFuncGetAttributes() 来引用内核。

This will ensure that the kernel is loaded without changing the state.
这将确保内核在不更改状态的情况下加载。

20.4. Querying whether Lazy Loading is turned on
20.4. 查询是否已启用延迟加载 

In order to check whether user enabled Lazy Loading, CUresult cuModuleGetLoadingMode ( CUmoduleLoadingMode* mode ) can be used.
为了检查用户是否启用了延迟加载,可以使用 CUresult cuModuleGetLoadingMode ( CUmoduleLoadingMode* mode )

It’s important to note that CUDA must be initialized before running this function. Sample usage can be seen in the snippet below.
请注意,在运行此函数之前必须初始化 CUDA。示例用法如下所示。

#include "cuda.h"
#include "assert.h"
#include "iostream"

int main() {
        CUmoduleLoadingMode mode;

        assert(CUDA_SUCCESS == cuInit(0));
        assert(CUDA_SUCCESS == cuModuleGetLoadingMode(&mode));

        std::cout << "CUDA Module Loading Mode is " << ((mode == CU_MODULE_LAZY_LOADING) ? "lazy" : "eager") << std::endl;

        return 0;
}

20.5. Possible issues when adopting lazy loading
20.5. 采用延迟加载时可能出现的问题 

Lazy Loading is designed so that it should not require any modifications to applications to use it. That said, there are some caveats, especially when applications are not fully compliant with CUDA Programming Model.
懒加载的设计使其不需要对应用程序进行任何修改即可使用。也就是说,有一些注意事项,特别是当应用程序与 CUDA 编程模型不完全兼容时。

20.5.1. Concurrent execution
20.5.1. 并发执行 

Loading kernels might require context synchronization. Some programs incorrectly treat the possibility of concurrent execution of kernels as a guarantee. In such cases, if program assumes that two kernels will be able to execute concurrently, and one of the kernels will not return without the other kernel executing, there is a possibility of a deadlock.
加载内核可能需要上下文同步。一些程序错误地将内核并发执行的可能性视为保证。在这种情况下,如果程序假设两个内核能够并发执行,并且其中一个内核在另一个内核执行之前不会返回,那么就有可能发生死锁。

If kernel A will be spinning in an infinite loop until kernel B is executing. In such case launching kernel B will trigger lazy loading of kernel B. If this loading will require context synchronization, then we have a deadlock: kernel A is waiting for kernel B, but loading kernel B is stuck waiting for kernel A to finish to synchronize the context.
如果内核 A 将在无限循环中旋转,直到内核 B 执行。在这种情况下,启动内核 B 将触发内核 B 的延迟加载。如果此加载需要上下文同步,则会发生死锁:内核 A 正在等待内核 B,但加载内核 B 卡在等待内核 A 完成以同步上下文。

Such program is an anti-pattern, but if for any reason you want to keep it you can do the following:
这样的程序是一种反模式,但如果出于任何原因您想要保留它,可以按照以下步骤操作:

  • preload all kernels that you hope to execute concurrently prior to launching them
    在启动它们之前,预加载所有希望同时执行的内核

  • run application with CUDA_MODULE_DATA_LOADING=EAGER to force loading data eagerly without forcing each function to load eagerly
    使用 CUDA_MODULE_DATA_LOADING=EAGER 运行应用程序,强制提前加载数据,而无需强制每个函数都提前加载

20.5.2. Allocators 20.5.2. 分配器 

Lazy Loading delays loading code from initialization phase of the program closer to execution phase. Loading code onto the GPU requires memory allocation.
懒加载将加载代码从程序的初始化阶段延迟到执行阶段更近。将代码加载到 GPU 需要内存分配。

If your application tries to allocate the entire VRAM on startup, e.g. to use it for its own allocator, then it might turn out that there will be no more memory left to load the kernels. This is despite the fact that overall Lazy Loading frees up more memory for the user. CUDA will need to allocate some memory to load each kernel, which usually happens at first launch time of each kernel. If your application allocator greedily allocated everything, CUDA will fail to allocate memory.
如果您的应用程序在启动时尝试分配整个 VRAM,例如将其用于自己的分配器,则可能会发现没有足够的内存来加载内核。尽管整体上惰性加载为用户释放了更多内存,但 CUDA 需要分配一些内存来加载每个内核,这通常发生在每个内核的首次启动时。如果您的应用程序分配器贪婪地分配了所有内容,CUDA 将无法分配内存。

Possible solutions: 可能的解决方案:

  • use cudaMallocAsync() instead of an allocator that allocates the entire VRAM on startup
    请使用 cudaMallocAsync(),而不是在启动时分配整个 VRAM 的分配器

  • add some buffer to compensate for the delayed loading of kernels
    为了补偿内核延迟加载,添加一些缓冲区

  • preload all kernels that will be used in the program before trying to initialize your allocator
    在尝试初始化分配器之前,预加载程序中将要使用的所有内核

20.5.3. Autotuning 20.5.3. 自动调谐 

Some applications launch several kernels implementing the same functionality to determine which one is the fastest. While it is overall advisable to run at least one warmup iteration, it becomes especially important with Lazy Loading. After all, including time taken to load the kernel will skew your results.
一些应用程序启动多个实现相同功能的内核,以确定哪个是最快的。总的来说,至少运行一个预热迭代是明智的,但对于延迟加载来说尤为重要。毕竟,包括加载内核所花费的时间会扭曲你的结果。

Possible solutions: 可能的解决方案:

  • do at least one warmup interaction prior to measurement
    在测量之前至少进行一次热身交互

  • preload the benchmarked kernel prior to launching it
    在启动之前预加载基准内核

21. Extended GPU Memory
21. 扩展 GPU 内存 

The Extended GPU Memory (EGM) feature, utilizing the high-bandwidth NVLink-C2C, facilitates efficient access to all system memory by GPUs, in a single-node system. EGM applies to integrated CPU-GPU NVIDIA systems by allowing physical memory allocation that can be accessed from any GPU thread within the setup. EGM ensures that all GPUs can access its resources at the speed of either GPU-GPU NVLink or NVLink-C2C.
扩展 GPU 内存(EGM)功能利用高带宽 NVLink-C2C,通过单节点系统实现 GPU 对所有系统内存的高效访问。EGM 适用于集成 CPU-GPU NVIDIA 系统,允许分配物理内存,可从设置内的任何 GPU 线程访问。EGM 确保所有 GPU 可以以 GPU-GPU NVLink 或 NVLink-C2C 的速度访问其资源。

EGM

In this setup, memory accesses occur via the local high-bandwidth NVLink-C2C. For remote memory accesses, GPU NVLink and, in some cases, NVLink-C2C are used. With EGM, GPU threads gain the capability to access all available memory resources, including CPU attached memory and HBM3, over the NVSwitch fabric.
在此设置中,内存访问通过本地高带宽 NVLink-C2C 进行。对于远程内存访问,使用 GPU NVLink 和在某些情况下使用 NVLink-C2C。通过 EGM,GPU 线程获得了访问所有可用内存资源的能力,包括 CPU 附加内存和 HBM3,通过 NVSwitch fabric。

21.1. Preliminaries 21.1. 准备工作 

Before diving into API changes for EGM functionalities, we are going to cover currently supported topologies, identifier assignment, prerequisites for virtual memory management, and CUDA types for EGM.
在深入研究 EGM 功能的 API 更改之前,我们将介绍当前支持的拓扑结构、标识符分配、虚拟内存管理的先决条件以及 EGM 的 CUDA 类型。

21.1.1. EGM Platforms: System topology
21.1.1. EGM 平台:系统拓扑 

Currently, EGM can be enabled in three platforms: (1) Single-Node, Single-GPU: Consists of an Arm-based CPU, CPU attached memory, and a GPU. Between the CPU and the GPU there is a high bandwidth C2C (Chip-to-Chip) interconnect. (2) Single-Node, Multi-GPU: Consists of fully connected four single-node, single-GPU platforms. (3) Multi-Node, Single-GPU: Two or more single-node multi-socket systems.
目前,EGM 可在三个平台上启用:(1) 单节点,单 GPU:由基于 Arm 的 CPU、CPU 附加内存和 GPU 组成。CPU 和 GPU 之间有高带宽的 C2C(芯片对芯片)互连。 (2) 单节点,多 GPU:由完全连接的四个单节点、单 GPU 平台组成。 (3) 多节点,单 GPU:两个或更多单节点多插槽系统。

Note 注意

Using cgroups to limit available devices will block routing over EGM and cause performance issues. Use CUDA_VISIBLE_DEVICES instead.
使用 cgroups 来限制可用设备将阻止 EGM 上的路由并导致性能问题。请改用 CUDA_VISIBLE_DEVICES

21.1.2. Socket Identifiers: What are they? How to access them?
21.1.2. 套接字标识符:它们是什么?如何访问它们? 

NUMA (Non-Uniform Memory Access) is a memory architecture used in multi-processor computer systems such that the memory is divided into multiple nodes. Each node has its own processors and memory. In such a system, NUMA divides the system into nodes and assigns a unique identifier (numaID) to every node.
NUMA(非一致性内存访问)是一种用于多处理器计算机系统的内存架构,使内存分为多个节点。每个节点都有自己的处理器和内存。在这样的系统中,NUMA 将系统划分为节点,并为每个节点分配一个唯一标识符(numaID)。

EGM uses the NUMA node identifier which is assigned by the operating system. Note that, this identifier is different from the ordinal of a device and it is associated with the closest host node. In addition to the existing methods, the user can obtain the identifier of the host node (numaID) by calling cuDeviceGetAttribute with CU_DEVICE_ATTRIBUTE_HOST_NUMA_ID attribute type as follows:
EGM 使用由操作系统分配的 NUMA 节点标识符。请注意,此标识符与设备的顺序不同,它与最近的主机节点相关联。除了现有的方法外,用户还可以通过调用 cuDeviceGetAttribute 并将 CU_DEVICE_ATTRIBUTE_HOST_NUMA_ID 属性类型作为参数来获取主机节点的标识符(numaID),如下所示:

int numaId;
cuDeviceGetAttribute(&numaId, CU_DEVICE_ATTRIBUTE_HOST_NUMA_ID, deviceOrdinal);

21.1.3. Allocators and EGM support
21.1.3. 分配器和 EGM 支持 

Mapping system memory as EGM does not cause any performance issues. In fact, accessing a remote socket’s system memory mapped as EGM is going to be faster. Because, with EGM traffic is guaranteed to be routed over NVLinks. Currently, cuMemCreate and cudaMemPoolCreate allocators are supported with appropriate location type and NUMA identifiers.
将系统内存映射为 EGM 不会导致任何性能问题。事实上,访问作为 EGM 映射的远程插槽的系统内存将更快。因为,使用 EGM,流量保证会通过 NVLinks 路由。目前,支持 cuMemCreatecudaMemPoolCreate 分配器,具有适当的位置类型和 NUMA 标识符。

21.1.4. Memory management extensions to current APIs
21.1.4. 当前 API 的内存管理扩展 

Currently, EGM memory can be mapped with Virtual Memory (cuMemCreate)  or Stream Ordered Memory (cudaMemPoolCreate) allocators. The user is responsible for allocating physical memory and mapping it to a virtual memory address space on all sockets.
目前,EGM 内存可以使用虚拟内存( cuMemCreate )或流有序内存( cudaMemPoolCreate )分配器进行映射。用户负责分配物理内存并将其映射到所有插槽上的虚拟内存地址空间。

Note 注意

Multi-node, single-GPU platforms require interprocess communication. Therefore we encourage the reader to see Chapter 3
多节点、单 GPU 平台需要进程间通信。因此,我们鼓励读者查看第 3 章。

Note 注意

We encourage readers to read CUDA Programming Guide’s Chapter 10 and Chapter 11 for a better understanding.
我们鼓励读者阅读 CUDA 编程指南的第 10 章和第 11 章,以更好地理解。

New CUDA property types have been added to APIs for allowing those approaches to understand allocation locations using NUMA-like node identifiers:
已向 API 中添加了新的 CUDA 属性类型,以便让这些方法能够使用类似 NUMA 节点标识符的方式了解分配位置

CUDA Type CUDA 类型

Used with 与使用

CU_MEM_LOCATION_TYPE_HOST_NUMA

CUmemAllocationProp for cuMemCreate  cuMemCreateCUmemAllocationProp

cudaMemLocationTypeHostNuma

cudaMemPoolProps for cudaMemPoolCreate  cudaMemPoolCreatecudaMemPoolProps

Note 注意

Please see CUDA Driver API and CUDA Runtime Data Types to find more about NUMA specific CUDA types.
请参阅 CUDA 驱动程序 API 和 CUDA 运行时数据类型,了解有关 NUMA 特定 CUDA 类型的更多信息。

21.2. Using the EGM Interface
21.2. 使用 EGM 接口 

21.2.1. Single-Node, Single-GPU
21.2.1. 单节点,单 GPU 

Any of the existing CUDA host allocators as well as system allocated memory can be used to benefit from high-bandwidth C2C. To the user, local access is what a host allocation is today.
任何现有的 CUDA 主机分配器以及系统分配的内存都可以用于从高带宽 C2C 中受益。对于用户来说,本地访问就是今天的主机分配。

Note 注意

Refer to the tuning guide for more information about memory allocators and page sizes.
请参考调整指南,了解有关内存分配器和页面大小的更多信息。

21.2.2. Single-Node, Multi-GPU
21.2.2. 单节点,多 GPU 

In a multi-GPU system, the user has to provide host information for the placement. As we mentioned, a natural way to express that information would be by using NUMA node IDs and EGM follows this approach. Therefore, using the cuDeviceGetAttribute function the user should be able to learn the closest NUMA node id. (See Socket Identifiers: What are they? How to access them?). Then the user can allocate and manage EGM memory using VMM (Virtual Memory Management) API or CUDA Memory Pool.
在多 GPU 系统中,用户必须提供主机信息以进行放置。正如我们所提到的,表达这些信息的一种自然方式是使用 NUMA 节点 ID,EGM 遵循这种方法。因此,用户应该能够使用 cuDeviceGetAttribute 函数来了解最接近的 NUMA 节点 ID(请参阅套接字标识符:它们是什么?如何访问它们?)。然后用户可以使用 VMM(虚拟内存管理)API 或 CUDA 内存池来分配和管理 EGM 内存。

21.2.2.1. Using VMM APIs
21.2.2.1. 使用 VMM API 

The first step in memory allocation using Virtual Memory Management APIs is to create a physical memory chunk that will provide a backing for the allocation. See CUDA Programming Guide’s Virtual Memory Management section for more details. In EGM allocations the user has to explicitly provide CU_MEM_LOCATION_TYPE_HOST_NUMA  as the location type and numaID as the location identifier. Also in EGM, allocations must be aligned to appropriate granularity of the platform. The following code snippet shows allocating physical memory with cuMemCreate:
使用虚拟内存管理 API 进行内存分配的第一步是创建一个物理内存块,该内存块将为分配提供支持。有关更多详细信息,请参阅 CUDA 编程指南的虚拟内存管理部分。在 EGM 分配中,用户必须显式提供 CU_MEM_LOCATION_TYPE_HOST_NUMA 作为位置类型和 numaID 作为位置标识符。此外,在 EGM 中,分配必须对齐到平台的适当粒度。以下代码片段显示了使用 cuMemCreate 分配物理内存:

CUmemAllocationProp prop{};
prop.type = CU_MEM_ALLOCATION_TYPE_PINNED;
prop.location.type = CU_MEM_LOCATION_TYPE_HOST_NUMA;
prop.location.id = numaId;
size_t granularity = 0;
cuMemGetAllocationGranularity(&granularity, &prop, MEM_ALLOC_GRANULARITY_MINIMUM);
size_t padded_size = ROUND_UP(size, granularity);
CUmemGenericAllocationHandle allocHandle;
cuMemCreate(&allocHandle, padded_size, &prop, 0);

After physical memory allocation, we have to reserve an address space and map it to a pointer. These procedures do not have EGM-specific changes:
在物理内存分配之后,我们必须保留一个地址空间并将其映射到一个指针。这些过程没有特定于 EGM 的更改:

CUdeviceptr dptr;
cuMemAddressReserve(&dptr, padded_size, 0, 0, 0);
cuMemMap(dptr, padded_size, 0, allocHandle, 0);

Finally, the user has to explicitly protect mapped virtual address ranges. Otherwise access to the mapped space would result in a crash. Similar to the memory allocation, the user has to provide CU_MEM_LOCATION_TYPE_HOST_NUMA as the location type and numaId as the location identifier. Following code snippet create an access descriptors for the host node and the GPU to give read and write access for the mapped memory to both of them:
最后,用户必须明确保护映射的虚拟地址范围。否则,访问映射空间将导致崩溃。与内存分配类似,用户必须将 CU_MEM_LOCATION_TYPE_HOST_NUMA 作为位置类型并将 numaId 作为位置标识符提供。以下代码片段为主机节点和 GPU 创建访问描述符,以便两者都可以读取和写入映射内存:

CUmemAccessDesc accessDesc[2]{{}};
accessDesc[0].location.type = CU_MEM_LOCATION_TYPE_HOST_NUMA;
accessDesc[0].location.id = numaId;
accessDesc[0].flags = CU_MEM_ACCESS_FLAGS_PROT_READWRITE;
accessDesc[1].location.type = CU_MEM_LOCATION_TYPE_DEVICE;
accessDesc[1].location.id = currentDev;
accessDesc[1].flags = CU_MEM_ACCESS_FLAGS_PROT_READWRITE;
cuMemSetAccess(dptr, size, accessDesc, 2);

21.2.2.2. Using CUDA Memory Pool
21.2.2.2. 使用 CUDA 内存池 

To define EGM, the user can create a memory pool on a node and give access to peers. In this case, the user has to explicitly define cudaMemLocationTypeHostNuma as the location type and numaId as the location identifier. The following code snippet shows creating a memory pool cudaMemPoolCreate:
要定义 EGM,用户可以在节点上创建内存池并授予对等方访问权限。在这种情况下,用户必须明确将 cudaMemLocationTypeHostNuma 定义为位置类型,并将 numaId 定义为位置标识符。以下代码片段显示了如何创建内存池 cudaMemPoolCreate

cudaSetDevice(homeDevice);
cudaMemPoolProps props{};
props.allocType = cudaMemAllocationTypePinned;
props.location.type = cudaMemLocationTypeHostNuma;
props.location.id = numaId;
cudaMemPoolCreate(&memPool, &props);

Additionally, for direct connect peer access, it is also possible to use the existing peer access API, cudaMemPoolSetAccess. An example for an accessingDevice is shown in the following code snippet:
此外,对于直接连接对等体访问,也可以使用现有的对等体访问 API, cudaMemPoolSetAccess 。以下代码片段显示了一个 accessingDevice 的示例:

cudaMemAccessDesc desc{};
desc.flags = cudaMemAccessFlagsProtReadWrite;
desc.location.type = cudaMemLocationTypeDevice;
desc.location.id = accessingDevice;
cudaMemPoolSetAccess(memPool, &desc, 1);

When the memory pool is created, and accesses are given, the user can set created memory pool to the residentDevice and start allocating memory using cudaMallocAsync:
当内存池被创建并分配访问权限后,用户可以将创建的内存池设置为 residentDevice,并开始使用 cudaMallocAsync 分配内存:

cudaDeviceSetMemPool(residentDevice, memPool);
cudaMallocAsync(&ptr, size, memPool, stream);

Note 注意

EGM is mapped with 2MB pages. Therefore, users may encounter more TLB misses when accessing very large allocations.
EGM 映射为 2MB 页。因此,用户在访问非常大的分配时可能会遇到更多的 TLB 未命中。

21.2.3. Multi-Node, Single-GPU
21.2.3. 多节点,单 GPU 

Beyond memory allocation, remote peer access does not have EGM-specific modification and it follows CUDA inter process (IPC) protocol. See CUDA Programming Guide for more details in IPC.
除了内存分配之外,远程对等访问没有特定于 EGM 的修改,而是遵循 CUDA 进程间通信(IPC)协议。有关 IPC 的更多详细信息,请参阅 CUDA 编程指南。

The user should allocate memory using cuMemCreate and again the user has to explicitly provide CU_MEM_LOCATION_TYPE_HOST_NUMA as the location type and numaID as the location identifier. In addition CU_MEM_HANDLE_TYPE_FABRIC should be defined as the requested handle type. The following code snippet shows allocating physical memory on Node A:
用户应使用 cuMemCreate 分配内存,然后用户必须显式提供 CU_MEM_LOCATION_TYPE_HOST_NUMA 作为位置类型和 numaID 作为位置标识符。此外, CU_MEM_HANDLE_TYPE_FABRIC 应定义为请求的句柄类型。以下代码片段显示在节点 A 上分配物理内存:

CUmemAllocationProp prop{};
prop.type = CU_MEM_ALLOCATION_TYPE_PINNED;
prop.requestedHandleTypes = CU_MEM_HANDLE_TYPE_FABRIC;
prop.location.type = CU_MEM_LOCATION_TYPE_HOST_NUMA;
prop.location.id = numaId;
size_t granularity = 0;
cuMemGetAllocationGranularity(&granularity, &prop,
                              MEM_ALLOC_GRANULARITY_MINIMUM);
size_t padded_size = ROUND_UP(size, granularity);
size_t page_size = ...;
assert(padded_size % page_size == 0);
CUmemGenericAllocationHandle allocHandle;
cuMemCreate(&allocHandle, padded_size, &prop, 0);

After creating allocation handle using cuMemCreate the user can export that handle to the other node, Node B, calling cuMemExportToShareableHandle:
使用 cuMemCreate 创建分配句柄后,用户可以将该句柄导出到另一个节点 B,调用 cuMemExportToShareableHandle

cuMemExportToShareableHandle(&fabricHandle, allocHandle,
                             CU_MEM_HANDLE_TYPE_FABRIC, 0);
// At this point, fabricHandle should be sent to Node B via TCP/IP.

On Node B, the handle can be imported using cuMemImportFromShareableHandle and treated as any other fabric handle
在节点 B 上,可以使用 cuMemImportFromShareableHandle 导入句柄,并将其视为任何其他织物句柄

// At this point, fabricHandle should be received from Node A via TCP/IP.
CUmemGenericAllocationHandle allocHandle;
cuMemImportFromShareableHandle(&allocHandle, &fabricHandle,
                               CU_MEM_HANDLE_TYPE_FABRIC);

When handle is imported at Node B, then the user can reserve an address space and map it locally in a regular fashion:
当在节点 B 导入 handle 时,用户可以保留一个地址空间并以常规方式在本地映射它:

size_t granularity = 0;
cuMemGetAllocationGranularity(&granularity, &prop,
                              MEM_ALLOC_GRANULARITY_MINIMUM);
size_t padded_size = ROUND_UP(size, granularity);
size_t page_size = ...;
assert(padded_size % page_size == 0);
CUdeviceptr dptr;
cuMemAddressReserve(&dptr, padded_size, 0, 0, 0);
cuMemMap(dptr, padded_size, 0, allocHandle, 0);

As the final step, the user should give appropriate accesses to each of the local GPUs at Node B. An example code snippet that gives read and write access to eight local GPUs:
作为最后一步,用户应该为节点 B 上的每个本地 GPU 提供适当的访问权限。以下是一个示例代码片段,为八个本地 GPU 提供读写访问权限:

// Give all 8 local  GPUS access to exported EGM memory located on Node A.                                                               |
CUmemAccessDesc accessDesc[8];
for (int i = 0; i < 8; i++) {
   accessDesc[i].location.type = CU_MEM_LOCATION_TYPE_DEVICE;
   accessDesc[i].location.id = i;
   accessDesc[i].flags = CU_MEM_ACCESS_FLAGS_PROT_READWRITE;
}
cuMemSetAccess(dptr, size, accessDesc, 8);

22. Notices 22. 通知 

22.1. Notice 22.1. 通知 

This document is provided for information purposes only and shall not be regarded as a warranty of a certain functionality, condition, or quality of a product. NVIDIA Corporation (“NVIDIA”) makes no representations or warranties, expressed or implied, as to the accuracy or completeness of the information contained in this document and assumes no responsibility for any errors contained herein. NVIDIA shall have no liability for the consequences or use of such information or for any infringement of patents or other rights of third parties that may result from its use. This document is not a commitment to develop, release, or deliver any Material (defined below), code, or functionality.
本文仅供信息目的,不得视为产品特定功能、状态或质量的保证。NVIDIA 公司(“NVIDIA”)对本文档中包含的信息的准确性或完整性不作任何明示或暗示的陈述或保证,并对本文档中包含的任何错误不承担任何责任。NVIDIA 对使用此类信息的后果或任何可能由此使用而导致的专利侵权或第三方其他权利不承担任何责任。本文档不构成开发、发布或交付任何材料(下文定义)、代码或功能的承诺。

NVIDIA reserves the right to make corrections, modifications, enhancements, improvements, and any other changes to this document, at any time without notice.
NVIDIA 保留随时在不通知的情况下对本文档进行更正、修改、增强、改进和任何其他更改的权利。

Customer should obtain the latest relevant information before placing orders and should verify that such information is current and complete.
客户在下订单之前应获取最新的相关信息,并验证该信息是否当前和完整。

NVIDIA products are sold subject to the NVIDIA standard terms and conditions of sale supplied at the time of order acknowledgement, unless otherwise agreed in an individual sales agreement signed by authorized representatives of NVIDIA and customer (“Terms of Sale”). NVIDIA hereby expressly objects to applying any customer general terms and conditions with regards to the purchase of the NVIDIA product referenced in this document. No contractual obligations are formed either directly or indirectly by this document.
NVIDIA 产品的销售受到 NVIDIA 标准销售条款和条件的约束,这些条款和条件在订单确认时提供,除非 NVIDIA 的授权代表与客户签署的个别销售协议另有约定(“销售条款”)。NVIDIA 特此明确反对就本文档中提及的 NVIDIA 产品的购买适用任何客户一般条款和条件。本文档既不直接也不间接形成任何合同义务。

NVIDIA products are not designed, authorized, or warranted to be suitable for use in medical, military, aircraft, space, or life support equipment, nor in applications where failure or malfunction of the NVIDIA product can reasonably be expected to result in personal injury, death, or property or environmental damage. NVIDIA accepts no liability for inclusion and/or use of NVIDIA products in such equipment or applications and therefore such inclusion and/or use is at customer’s own risk.
NVIDIA 产品并非设计、授权或保证适用于医疗、军事、飞行器、航天器、生命维持设备,也不适用于 NVIDIA 产品故障或失效可能导致人身伤害、死亡、财产或环境损害的应用场景。NVIDIA 不对在此类设备或应用中包含和/或使用 NVIDIA 产品承担任何责任,因此此类包含和/或使用由客户自行承担风险。

NVIDIA makes no representation or warranty that products based on this document will be suitable for any specified use. Testing of all parameters of each product is not necessarily performed by NVIDIA. It is customer’s sole responsibility to evaluate and determine the applicability of any information contained in this document, ensure the product is suitable and fit for the application planned by customer, and perform the necessary testing for the application in order to avoid a default of the application or the product. Weaknesses in customer’s product designs may affect the quality and reliability of the NVIDIA product and may result in additional or different conditions and/or requirements beyond those contained in this document. NVIDIA accepts no liability related to any default, damage, costs, or problem which may be based on or attributable to: (i) the use of the NVIDIA product in any manner that is contrary to this document or (ii) customer product designs.
NVIDIA 不表示或保证基于本文档的产品适用于任何指定用途。 NVIDIA 未必对每种产品的所有参数进行测试。 评估和确定本文档中包含的任何信息的适用性,确保产品适合客户计划的应用,并执行必要的应用测试以避免应用或产品的默认是客户的唯一责任。 客户产品设计中的弱点可能会影响 NVIDIA 产品的质量和可靠性,并可能导致超出本文档所包含的附加或不同条件和/或要求。 NVIDIA 不承担与任何默认,损坏,成本或问题相关的责任,这些可能基于或归因于:(i)以与本文档相悖的方式使用 NVIDIA 产品或(ii)客户产品设计。

No license, either expressed or implied, is granted under any NVIDIA patent right, copyright, or other NVIDIA intellectual property right under this document. Information published by NVIDIA regarding third-party products or services does not constitute a license from NVIDIA to use such products or services or a warranty or endorsement thereof. Use of such information may require a license from a third party under the patents or other intellectual property rights of the third party, or a license from NVIDIA under the patents or other intellectual property rights of NVIDIA.
本文档不授予任何 NVIDIA 专利权、版权或其他 NVIDIA 知识产权的明示或暗示许可。NVIDIA 发布的有关第三方产品或服务的信息并不构成 NVIDIA 授予使用这些产品或服务的许可,也不构成对其的担保或认可。使用此类信息可能需要从第三方获得专利或其他知识产权的许可,或者从 NVIDIA 获得 NVIDIA 专利或其他知识产权的许可。

Reproduction of information in this document is permissible only if approved in advance by NVIDIA in writing, reproduced without alteration and in full compliance with all applicable export laws and regulations, and accompanied by all associated conditions, limitations, and notices.
本文档中的信息复制只有在 NVIDIA 事先书面批准的情况下才允许,复制时不得更改,并且必须完全遵守所有适用的出口法律法规,并附有所有相关条件、限制和通知。

THIS DOCUMENT AND ALL NVIDIA DESIGN SPECIFICATIONS, REFERENCE BOARDS, FILES, DRAWINGS, DIAGNOSTICS, LISTS, AND OTHER DOCUMENTS (TOGETHER AND SEPARATELY, “MATERIALS”) ARE BEING PROVIDED “AS IS.” NVIDIA MAKES NO WARRANTIES, EXPRESSED, IMPLIED, STATUTORY, OR OTHERWISE WITH RESPECT TO THE MATERIALS, AND EXPRESSLY DISCLAIMS ALL IMPLIED WARRANTIES OF NONINFRINGEMENT, MERCHANTABILITY, AND FITNESS FOR A PARTICULAR PURPOSE. TO THE EXTENT NOT PROHIBITED BY LAW, IN NO EVENT WILL NVIDIA BE LIABLE FOR ANY DAMAGES, INCLUDING WITHOUT LIMITATION ANY DIRECT, INDIRECT, SPECIAL, INCIDENTAL, PUNITIVE, OR CONSEQUENTIAL DAMAGES, HOWEVER CAUSED AND REGARDLESS OF THE THEORY OF LIABILITY, ARISING OUT OF ANY USE OF THIS DOCUMENT, EVEN IF NVIDIA HAS BEEN ADVISED OF THE POSSIBILITY OF SUCH DAMAGES. Notwithstanding any damages that customer might incur for any reason whatsoever, NVIDIA’s aggregate and cumulative liability towards customer for the products described herein shall be limited in accordance with the Terms of Sale for the product.
本文档和所有 NVIDIA 设计规格、参考板、文件、图纸、诊断、清单和其他文件(统称和单独称为“材料”)均按原样提供。 NVIDIA 对材料不作任何明示、暗示、法定或其他方面的担保,并明确否认所有侵权、适销性和特定用途适用性的隐含担保。 在法律允许的范围内,无论如何,NVIDIA 都不对因使用本文档而产生的任何损害承担责任,包括但不限于任何直接、间接、特殊、附带、惩罚性或后果性损害,无论原因如何,无论责任理论如何,即使 NVIDIA 已被告知可能发生此类损害。 尽管客户可能因任何原因而遭受任何损害,但 NVIDIA 对客户因本文所述产品而承担的累计责任应根据产品的销售条款限制。

22.2. OpenCL

OpenCL is a trademark of Apple Inc. used under license to the Khronos Group Inc.
OpenCL 是 Apple Inc. 的商标,根据许可使用于 Khronos Group Inc.

22.3. Trademarks 22.3. 商标 

NVIDIA and the NVIDIA logo are trademarks or registered trademarks of NVIDIA Corporation in the U.S. and other countries. Other company and product names may be trademarks of the respective companies with which they are associated.
NVIDIA 和 NVIDIA 标志是 NVIDIA Corporation 在美国和其他国家的商标或注册商标。其他公司和产品名称可能是与其关联的各自公司的商标。